Real-Time Action Detection Based on Spatio-Temporal Interaction Perception

KE Xiao; MIAO Xin; GUO Wen-zhong

您当前的位置：

首页 >

文章列表页 >

Real-Time Action Detection Based on Spatio-Temporal Interaction Perception

更新时间：2023-08-29

- Real-Time Action Detection Based on Spatio-Temporal Interaction Perception
- ACTA ELECTRONICA SINICA Pages: 1-15(2023)
- 作者机构：
  
  1.福州大学计算机与大数据学院，福建福州 350116
  2.福建省网络计算与智能信息处理重点实验室（福州大学），福建福州 350116
  3.空间数据挖掘与信息共享教育部重点实验室，福建福州 350003
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(61972097);National Key Research and Development Plan of China(U21A20472);Major Science and Technology Project of Fujian Province(2021YFB3600503);Natural Science Foundation of Fujian Province(2021HZ022007)
- DOI：
  CLC： TP391;
- Received：18 July 2022，
  
  Published Online：29 August 2023，
- 稿件说明：
移动端阅览
KE Xiao, MIAO Xin, GUO Wen-zhong. Real-Time Action Detection Based on Spatio-Temporal Interaction Perception[J/OL]. ACTA ELECTRONICA SINICA, 2023, 1-15.
DOI：

KE Xiao, MIAO Xin, GUO Wen-zhong. Real-Time Action Detection Based on Spatio-Temporal Interaction Perception[J/OL]. ACTA ELECTRONICA SINICA, 2023, 1-15. DOI：

摘要

时空动作检测依赖于视频空间信息与时间信息的学习。目前最先进的基于卷积神经网络的动作检测器采用2D CNN或3D CNN架构，取得了显著的效果。然而，由于网络结构的复杂性与时空信息感知的原因，这些方法通常采用非实时、离线的方式。时空动作检测主要的挑战在于设计高效的检测网络架构，并能有效地感知融合时空特征。考虑到上述问题，本文提出了一种基于时空交叉感知的实时动作检测方法，该方法首先通过对输入视频进行乱序重排来增强时序信息，针对仅使用2D或3D骨干网络无法有效对时空特征进行建模，提出了基于时空交叉感知的多分支特征提取网络。针对单一尺度时空特征描述性不足，提出一个多尺度注意力网络来学习长期的时间依赖和空间上下文信息。针对时序和空间两种不同来源特征的融合，提出了一种新的运动显著性增强融合策略，对时空信息进行编码交叉映射，引导时序特征和空间特征之间的融合，突出更具辨别力的时空特征表示。最后，基于帧级检测器结果在线计算动作关联性链接。本文提出的方法在两个时空动作数据集UCF101-24和JHMDB-21上分别达到了84.71%和78.4%的准确率，优于现有最先进的方法，并达到73帧/秒的速度。此外，针对JHMDB-21数据集存在高类间相似性与难样本数据易于混淆等问题，本文提出了基于动作表示的关键帧光流动作检测方法，避免了冗余光流的产生，进一步提升了动作检测准确率。

Abstract

Spatiotemporal action detection requires incorporation of video spatial and temporal information. Current state-of-the-art approaches usually use a 2D CNN or a 3D CNN architecture. However

due to the complexity of network structure and spatiotemporal information extraction

these methods are usually non-real-time and offline. To solve this problem

this paper proposes a real-time action detection method based on spatiotemporal interaction perception. First of all

the input video is rearranged out of order to enhance the temporal information. As 2D or 3D backbone networks cannot be used to model spatiotemporal features effectively

a multi-branch feature extraction network is proposed to extract features from different sources. And a multi-scale attention network is proposed to extract long-term time-dependent and spatial context information. Then

for the fusion of temporal and spatial features from two different sources

a new motion saliency enhancement fusion strategy is proposed

which guides the fusion between features by encoding temporal and spatial features to highlight more discriminative spatiotemporal features. Finally

action tube links are generated online based on the frame-level detector results. The proposed method achieves an accuracy of 84.71% and 78.4% on two spatiotemporal motion datasets UCF101-24 and JHMDB-21. And it provides a speed of 73 frames per second

which is superior to the state-of-the-art methods. In addition

for the problems of high inter-class similarity and easy confusion of difficult sample data in the JHMDB-21 dataset

this paper proposes an action detection method of key frame optical flow based on action representation

which avoids the generation of redundant optical flow and further improves the accuracy of action detection.

关键词

Keywords

references

SHAO D , ZHAO Y , DAI B , et al . Finegym: A hierarchical video dataset for fine-grained action understanding [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle : IEEE , 2020 : 2616 - 2625 .

罗会兰 , 童康 , 孔繁胜 . 基于深度学习的视频中人体动作识别进展综述 [J]. 电子学报 , 2019 , 47 ( 5 ): 1162 - 1173 .

LUO Hui-lan , TONG Kang , KONG Fan-sheng . The Progress of Human Action Recognition in Videos Based on Deep Learning: A Review [J]. Acta Electronica Sinica , 2019 , 47 ( 5 ): 1162 - 1173 . (in Chinese)

杨珂 , 王敬宇 , 戚琦 , 孙海峰 , 王晶 , 廖建新 . LSCN:一种用于动作识别的长短时序关注网络 [J]. 电子学报 , 2020 , 48 ( 3 ): 503 - 509 .

YANG Ke , WANG Jing-yu , QI Qi , SUN Hai-feng , WANG Jing , LIAO Jian-xin . LSCN: Concerning Long and Short Sequence Together for Action Recognition [J]. Acta Electronica Sinica , 2020 , 48 ( 3 ): 503 - 509 . (in Chinese)

XU M , XIONG Y , CHEN H , et al . Long short-term transformer for online action detection [J]. Advances in Neural Information Processing Systems , 2021 , 34 : 1086 - 1099 .

DAI R , DAS S , KAHATAPITIYA K , et al . MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans : IEEE , 2022 : 20041 - 20051 .

PAN J , CHEN S , SHOU M Z , et al . Actor-context-actor relation network for spatio-temporal action localization [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville : IEEE , 2021 : 464 - 474 .

FEICHTENHOFER C , PINZ A , ZISSERMAN A . Convolutional two-stream network fusion for video action recognition [C]// Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . Las Vegas : IEEE , 2016 : 1933 - 1941 .

ZHAO J , SNOEK C G . Dance with flow: Two-in-one stream action detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach : IEEE , 2019 : 9935 - 9944 .

LI H , JIANG X , GUAN B , et al . Joint feature optimization and fusion for compressed action recognition [J]. IEEE Transactions on Image Processing , 2021 , 30 : 7926 - 7937 .

SHOU Z , LIN X , KALANTIDIS Y , et al . Dmc-net: Generating discriminative motion cues for fast compressed video action recognition [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach : IEEE , 2019 : 1268 - 1277 .

TAO L , WANG X , YAMASAKI T . Rethinking Motion Representation: Residual Frames With 3D ConvNets [J]. IEEE Transactions on Image Processing , 2021 , 30 : 9231 - 9244 .

桑海峰 , 赵子裕 , 何大阔 . 基于循环区域关注和视频帧关注的视频行为识别网络设计 [J]. 电子学报 , 2020 , 48 ( 6 ): 1052 - 1061 .

SANG Hai-feng , ZHAO Zi-yu , HE Da-kuo . Recurrent Region Attention and Video Frame Attention Based Video Action Recognition Network Design [J]. Acta Electronica Sinica , 2020 , 48 ( 6 ): 1052 - 1061 . (in Chinese)

QIU Z , YAO T , MEI T . Learning spatio-temporal representation with pseudo-3d residual networks [C]// proceedings of the IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 5533 - 5541 .

TRAN D , WANG H , TORRESANI L , et al . A closer look at spatiotemporal convolutions for action recognition [C]// Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . Salt Lake City : IEEE , 2018 : 6450 - 6459 .

DONAHUE J , ANNE HENDRICKS L , GUADARRAMA S , et al . Long-term recurrent convolutional networks for visual recognition and description [C]// Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . Boston : IEEE , 2015 : 2625 - 2634 .

LI Y , SONG S , LI Y , et al . Temporal Bilinear Networks for Video Action Recognition [J]. Proceedings of the AAAI Conference on Artificial Intelligence , 2019 , 33 ( 01 ): 8674 - 8681 .

LIN J , GAN C , HAN S . Tsm: Temporal shift module for efficient video understanding [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Seoul : IEEE , 2019 : 7083 - 7093 .

LI Y , JI B , SHI X , et al . Tea: Temporal excitation and aggregation for action recognition [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle : IEEE , 2020 : 909 - 918 .

SIMONYAN K , ZISSERMAN A . Two-stream convolutional networks for action recognition in videos [J]. Advances in neural information processing systems , 2014 , 27 : 568 - 576 .

LIU X , WANG Q , HU Y , et al . End-to-end temporal action detection with transformer [J]. IEEE Transactions on Image Processing , 2022 , 31 : 5427 - 5441 .

JACOB G M , STENGER B . Facial action unit detection with transformers [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville : IEEE , 2021 : 7680 - 7689 .

GKIOXARI G , MALIK J . Finding action tubes [C]// Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . Boston : IEEE , 2015 : 759 - 768 .

HOU R , CHEN C , SHAH M . Tube convolutional neural network (T-CNN) for action detection in videos [C]// Proceedings of the IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 5822 - 5831 .

LIU Y , YANG F , GINHAC D . ACDnet: An Action Detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation [J]. Pattern Recognition Letters , 2021 , 145 : 118 - 126 .

SINGH G , SAHA S , SAPIENZA M , et al . Online real-time multiple spatiotemporal action localisation and prediction [C]// Proceedings of the IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 3637 - 3646 .

SAHA S , SINGH G , SAPIENZA M , et al . Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos [C]// British Machine Vision Conference 2016 . York : BMVA , 2016 .

KALOGEITON V , WEINZAEPFEL P , FERRARI V , et al . Action tubelet detector for spatio-temporal action localization [C]// Proceedings of the IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 4405 - 4413 .

LI Y , WANG Z , WANG L , et al . Actions as moving points [C]// European Conference on Computer Vision . Glasgow : Springer , 2020 : 68 - 84 .

KUMAR A , RAWAT Y S . End-to-End Semi-Supervised Learning for Video Action Detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans : IEEE , 2022 : 14700 - 14710 .

YANG X , YANG X , LIU M-Y , et al . Step: Spatio-temporal progressive learning for video action detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach : IEEE , 2019 : 264 - 272 .

LIU Y , YANG F , GINHAC D . TEDdet: Temporal Feature Exchange and Difference Network for Online Real-Time Action Detection [J]. IEEE Access , 2022 , 10 : 37870 - 37881 .

胡正平 , 刁鹏成 , 张瑞雪 , 李淑芳 , 赵梦瑶 . 3D多支路聚合轻量网络视频行为识别算法研究 [J]. 电子学报 , 2020 , 48 ( 7 ): 1261 - 1268 .

HU Zheng-ping , DIAO Peng-cheng , ZHANG Rui-xue , LI Shu-fang , ZHAO Meng-yao . Research on 3D Multi-Branch Aggregated Lightweight Network Video Action Recognition Algorithm [J]. Acta Electronica Sinica , 2020 , 48 ( 7 ): 1261 - 1268 . (in Chinese)

WANG Z , SHE Q , SMOLIC A . Action-net: Multipath excitation for action recognition [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville : IEEE , 2021 : 13214 - 13223 .

PRAMONO R R A , CHEN Y-T , FANG W-H . Spatial-Temporal Action Localization With Hierarchical Self-Attention [J]. IEEE Transactions on Multimedia , 2021 , 24 : 625 - 639 .

罗会兰 , 王婵娟 . 行为识别中一种基于融合特征的改进VLAD编码方法 [J]. 电子学报 , 2019 , 47 ( 1 ): 49 - 58 .

LUO Hui-lan , WANG Chan-juan . An Improved VLAD Coding Method Based on Fusion Feature in Action Recognition [J]. Acta Electronica Sinica , 2019 , 47 ( 1 ): 49 - 58 . (in Chinese)

WANG X , GIRSHICK R , GUPTA A , et al . Non-local neural networks [C]// Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . Salt Lake City : IEEE , 2018 : 7794 - 7803 .

YUE K , SUN M , YUAN Y , et al . Compact generalized non-local network [J]. Advances in neural information processing systems , 2018 , 31 : 6510 - 6519 .

CAO Y , XU J , LIN S , et al . Gcnet: Non-local networks meet squeeze-excitation networks and beyond [C]// Proceedings of the IEEE/CVF international conference on computer vision workshops . Seoul : IEEE , 2019 : 0 - 0 .

CHEN Y , KALANTIDIS Y , LI J , et al . A^ 2-nets: Double attention networks [J]. Advances in neural information processing systems , 2018 , 31 : 352 - 361 .

LI X , ZHONG Z , WU J , et al . Expectation-maximization attention networks for semantic segmentation [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Seoul : IEEE , 2019 : 9167 - 9176 .

CHEN W , ZHU X , SUN R , et al . Tensor low-rank reconstruction for semantic segmentation [C]// European Conference on Computer Vision . Glasgow : Springer , 2020 : 52 - 69 .

LIN T-Y , DOLLáR P , GIRSHICK R , et al . Feature pyramid networks for object detection [C]// Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . Honolulu : IEEE , 2017 : 2117 - 2125 .

CHEN Q , WANG Y , YANG T , et al . You only look one-level feature [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville : IEEE , 2021 : 13039 - 13048 .

KöPüKLü O , WEI X , RIGOLL G . You only watch once: A unified cnn architecture for real-time spatiotemporal action localization [EB/OL]. ( 2021-10-18 )[ 2022-07-16 ]. https://arxiv.org/abs/1911.06644v5 https://arxiv.org/abs/1911.06644v5 .

SOOMRO K , ZAMIR A R , SHAH M . UCF101: A dataset of 101 human actions classes from videos in the wild [EB/OL]. ( 2012-12-03 )[ 2022-07-16 ]. https://arxiv.org/abs/1212.0402 https://arxiv.org/abs/1212.0402 .

KUEHNE H , JHUANG H , GARROTE E , et al . HMDB: a large video database for human motion recognition [C]// 2011 International conference on computer vision . Barcelona : IEEE , 2011 : 2556 - 2563 .

TEED Z , DENG J . Raft: Recurrent all-pairs field transforms for optical flow [C]// European Conference on Computer Vision . Glasgow : Springer , 2020 : 402 - 419 .

ZHANG H , ZHAO X . Spatio-Temporal Motion Aggregation Network for Video Action Detection [C]// ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Singapore : IEEE , 2022 : 2180 - 2184 .

ZHANG D , HE L , TU Z , et al . Learning motion representation for real-time spatio-temporal action localization [J]. Pattern Recognition , 2020 , 103 : 107312 .

PRAMONO R R A , CHEN Y-T , FANG W-H . Hierarchical self-attention network for action localization in videos [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Seoul : IEEE , 2019 : 61 - 70 .

DIBA A , SHARMA V , VAN GOOL L . Deep temporal linear encoding networks [C]// Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . Honolulu : IEEE , 2017 : 2329 - 2338 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Real-Time Action Detection Based on Spatio-Temporal Interaction Perception

Related Author

No data

Related Institution

No data

⁰