面向行为边界框生成的端到端时间全局相关网络

马百腾; 张士伟; 高常鑫; 桑农

doi:10.12263/DZXB.20201302

您当前的位置：

首页 >

文章列表页 >

面向行为边界框生成的端到端时间全局相关网络

学术论文 | 更新时间：2025-12-08

- 面向行为边界框生成的端到端时间全局相关网络
- Temporal Global Correlation Network for End-to-End Action Proposal Generation
- 电子学报 2022年50卷第10期页码：2452-2461
- 作者机构：
  
  1.华中科技大学人工智能与自动化学院图像信息处理与智能控制教育部重点实验室，湖北武汉 430000
  2.阿里巴巴达摩院科技有限公司，浙江杭州 310000
- 作者简介：
  
  [ "马百腾男，1995年生.华中科技大学人工智能与自动化学院硕士研究生.主要研究方向为视频处理、行为检测、计算机视觉与模式识别.E-mail: btm@hust.edu.cn" ]
  张士伟男，阿里巴巴达摩院（杭州）科技有限公司高级算法工程师.主要研究方向为行为检测、计算机视觉与模式识别.E-mail: zhangjin.zsw@alibaba-inc.com
  高常鑫男，华中科技大学人工智能与自动化学院副教授.主要研究方向为计算机视觉、模式识别和智能视频分析.E-mail: cgao@hust.edu.cn
  [ "桑农（通讯作者）男，华中科技大学人工智能与自动化学院教授.主要研究方向为计算机视觉、模式识别." ]
- 基金信息：
  
  国家自然科学基金(61871435)
- DOI：10.12263/DZXB.20201302
  中图分类号： TP391;
- 收稿：2020-11-18，
  
  修回：2021-03-01，
  
  纸质出版：2022-10-25
- 稿件说明：
移动端阅览
马百腾,张士伟,高常鑫等.面向行为边界框生成的端到端时间全局相关网络[J].电子学报,2022,50(10):2452-2461.

MA Bai-teng,ZHANG Shi-wei,GAO Chang-xin,et al.Temporal Global Correlation Network for End-to-End Action Proposal Generation[J].ACTA ELECTRONICA SINICA,2022,50(10):2452-2461.
马百腾,张士伟,高常鑫等.面向行为边界框生成的端到端时间全局相关网络[J].电子学报,2022,50(10):2452-2461. DOI： 10.12263/DZXB.20201302.

MA Bai-teng,ZHANG Shi-wei,GAO Chang-xin,et al.Temporal Global Correlation Network for End-to-End Action Proposal Generation[J].ACTA ELECTRONICA SINICA,2022,50(10):2452-2461. DOI： 10.12263/DZXB.20201302.

摘要

时序行为边界框生成任务的目的是定位未剪辑视频中行为的开始和结束时间.现有的生成行为边界框的方法存在两个缺点：所使用的特征不具有足够的时间全局信息，导致了边界框的不准确；特征提取和边界框生成的过程是分开的，导致生成的特征不完全适合边界框生成任务.为了解决上述问题，本文提出了时间全局相关网络（Temporal Global Correlation Network

TGCNet），利用时间全局相关（Temporal Global Correlation

TGC）模块获取全局信息.TGC模块主要包含动态相关结构和静态相关结构，分别编码动态和静态全局信息.TGCNet网络可以以端到端的方式训练，使得所学习到的特征更适合时序行为边界框生成任务.本文在两个具有挑战性的数据集THUMOS14和ActivityNet1.3上进行了实验，结果表明，所提出的TGCNet网络在这两个数据集上均达到了最好的时序行为边界框生成性能.

Abstract

The purpose of the temporal action proposal generation task is to locate the start and end time of the action in the untrimmed video. The existing methods of temporal action proposal generation are suboptimal because of two reasons: the applied features cannot encode sufficient temporal global information

which may result in imprecise proposals; the procedures of feature extracting and proposal generating are separate

hence the features may be not completely suitable for the proposal generation task. To solve this problem

we propose the temporal global correlation network (TGCNet) by repeatedly embedding well designed temporal global correlation (TGC) module to encode temporal global information. Specifically

the TGC module mainly contains a dynamic correlation structure and a static correlation structure

which target to encode dynamic and static global information

respectively. Most importantly

TGCNet can be trained in an end-to-end manner

which makes the features leaned by TGCNet are more suitable for action proposal generation. We perform experiments on two challenging datasets: THUMOS14 and ActivityNet1.3

and the results show that the proposed TGCNet achieves state-of-the-art temporal action proposal generation performance on the both datasets.

关键词

Keywords

references

BUCH S , ESCORCIA V , SHEN C , et al . Sst: Single-stream temporal action proposals [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu : IEEE , 2017 : 2911 - 2920 .

GAO J , YANG Z , CHEN K , et al . Turn tap: Temporal unit regression network for temporal action proposals [C]// Proceedings of the IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 3628 - 3636 .

SHOU Z , WANG D , CHANG S F . Temporal action localization in untrimmed videos via multi-stage cnns [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas : IEEE , 2016 : 1049 - 1058 .

LIN T , ZHAO X , SU H , et al . Bsn: Boundary sensitive network for temporal action proposal generation [C]// Proceedings of the European Conference on Computer Vision . Munich : Springer , 2018 : 3 - 19 .

LIN T , LIU X , LI X , et al . Bmn: boundary-matching network for temporal action proposal generation [C]// Proceedings of the IEEE International Conference on Computer Vision . Seoul : IEEE , 2019 : 3889 - 3898 .

LIN C , LI J , WANG Y , et al . Fast learning of temporal action proposal via dense boundary generator [C]// Proceedings of the AAAI Conference on Artificial Intelligence . New York : AAAI , 2020 : 11499 - 11506 .

WANG H , SCHMID C . Action recognition with improved trajectories [C]// Proceedings of the IEEE International Conference on Computer Vision . Sydney : IEEE , 2013 : 3551 - 3558 .

WANG L , XIONG Y , WANG Z . Temporal segment networks: towards good practices for deep action recognition [C]// Proceedings of the European Conference on Computer Vision . Amsterdam : Springer , 2016 : 20 - 36 .

丰艳 , 张甜甜 , 王传旭 . 基于伪3D残差网络与交互关系建模的群组行为识别方法 [J]. 电子学报 , 2020 , 48 ( 7 ): 1269 - 1275 .

FENG Y , ZHANG T T , WANG C X . Group activity recognition method based on pseudo 3d residual network and interaction modeling [J]. Acta Electronica Sinica , 2020 , 48 ( 7 ): 1269 - 1275 . (in Chinese)

胡正平 , 刁鹏成 , 张瑞雪 , 等 . 3D多支路聚合轻量网络视频行为识别算法研究 [J]. 电子学报 , 2020 , 48 ( 7 ): 1261 - 1268 .

HU Z P , DIAO P C , ZHANG R X , et al . Research on 3d muli-branch aggregated lightweight network video action recognition algorithm [J]. Acta Electronica Sinica , 2020 , 48 ( 7 ): 1261 - 1268 . (in Chinese)

TRAN D , BOURDEV L , FERGUS R , et al . Learning spatiotemporal features with 3d convolutional networks [C]// Proceedings of the IEEE International Conference on Computer Vision . Santiago : IEEE , 2015 : 4489 - 4497 .

QIU Z , YAO T , MEI T . Learning spatio-temporal representation with pseudo-3d residual networks [C]// Proceedings of the IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 5534 - 5542

PENG C , ZHANG X , YU G , et al . Large kernel matters--improve semantic segmentation by global convolutional network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu : IEEE , 2017 : 4353 - 4361 .

YU C , WANG J , PENG C , et al . Learning a discriminative feature network for semantic segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Salt Lake City : IEEE , 2018 : 1857 - 1866 .

ZHANG H , DANA K , SHI J , et al . Context encoding for semantic segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Salt Lake City : IEEE , 2018 : 7151 - 7160 .

FU J , LIU J , TIAN H , et al . Dual attention network for scene segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Long Beach : IEEE , 2019 : 3146 - 3154 .

BODLA N , SINGH B , CHELLAPPA R . Soft-nms improving object detection with one line of code [C]// Proceedings of the IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 5561 - 5569 .

HARA K , KATAOKA H , SATOH Y . Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Salt Lake City : IEEE , 2018 : 6546 - 6555 .

GORBAN A , IDREES H , JIANG Y G , et al . THUMOS Challenge: Action Recognition with a Large Number of Classes [EB/OL]. ( 2015 )[2020]. http://www.thumos.info http://www.thumos.info .

CABA HEILBRON F , ESCORCIA V , GHANEM B , et al . Activitynet: A large-scale video benchmark for human activity understanding [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston : IEEE , 2015 : 961 - 970 .

KAY W , CARREIRA J , SIMONYAN K , et al . The Kinetics Human Action Video Dataset [EB/OL]. ( 2017-05-19 )[ 2020-11-18 ]. https://arxiv.org/abs/1705.06950 https://arxiv.org/abs/1705.06950 .

DAI X , SINGH B , ZHANG G , et al . Temporal context network for activity localization in videos [C]// Proceedings of the IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 5793 - 5802 .

GHANEM B , NIEBLES J C , SNOEK C , et al . Activitynet Challenge 2017 Summary [EB/OL]. ( 2017-10-22 )[ 2020-11-18 ]. https://arxiv.org/abs/1710.08011 https://arxiv.org/abs/1710.08011 .

LIN T W , ZHAO X , SHOU Z . Temporal Convolution Based Action Proposal: Submission to Activitynet 2017 [EB/OL]. ( 2017-06-21 )[ 2020-11-18 ]. https://arxiv.org/abs/1707.06750 https://arxiv.org/abs/1707.06750 .

GAO J , CHEN K , NEVATIA R . Ctap: Complementary temporal action proposal generation [C]// Proceedings of the European Conference on Computer Vision . Munich : IEEE , 2018 : 68 - 83 .

XU H , DAS A , SAENKO K . R-c3d: region convolutional 3d network for temporal activity detection [C]// Proceedings of the IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 5783 - 5792 .

WANG L , XIONG Y , LIN D , et al . Untrimmednets for weakly supervised action recognition and detection [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu : IEEE , 2017 : 4325 - 4334 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据