基于时空交叉感知的实时动作检测方法

柯逍; 缪欣; 郭文忠

doi:10.12263/DZXB.20220859

您当前的位置：

首页 >

文章列表页 >

基于时空交叉感知的实时动作检测方法

学术论文 | 更新时间：2025-12-11

- 基于时空交叉感知的实时动作检测方法
- Real-Time Action Detection Based on Spatio-Temporal Interaction Perception
- 电子学报 2024年52卷第2期页码：574-588
- 作者机构：
  
  1.福州大学计算机与大数据学院，福建福州 350116
  2.福建省网络计算与智能信息处理重点实验室（福州大学），福建福州 350116
  3.空间数据挖掘与信息共享教育部重点实验室，福建福州 350003
- 作者简介：
  
  [ "柯逍男，1983年生，福建福州人.博士，福州大学教授、博士生导师.主要研究方向为计算机视觉、模式识别. E-mail: kex@fzu.edu.cn" ]
  [ "缪欣女，1997年生，福建福安人.福州大学计算机与大数据学院硕士研究生.主要研究方向为计算机视觉、动作识别. E-mail: 200320077@fzu.edu.cn" ]
  [ "郭文忠男，1979年生，福建惠安人.博士，福州大学教授，博士生导师.主要研究方向为计算智能及其应用. E-mail: guowenzhong@fzu.edu.cn" ]
- 基金信息：
  
  国家自然科学基金(61972097;U21A20472);国家重点研发计划(2021YFB3600503);福建省科技重大专项(2021HZ022007);福建省自然科学基金(2021J01612;2020J01494)
- DOI：10.12263/DZXB.20220859
  中图分类号： TP391;
- 收稿：2022-07-20，
  
  修回：2023-02-28，
  
  纸质出版：2024-02-25
- 稿件说明：
移动端阅览
柯逍,缪欣,郭文忠.基于时空交叉感知的实时动作检测方法[J].电子学报,2024,52(02):574-588.

KE Xiao, MIAO Xin, GUO Wen-zhong.Real-Time Action Detection Based on Spatio-Temporal Interaction Perception[J].Acta Electronica Sinica, 2024, 52(02): 574-588.
柯逍,缪欣,郭文忠.基于时空交叉感知的实时动作检测方法[J].电子学报,2024,52(02):574-588. DOI：10.12263/DZXB.20220859

KE Xiao, MIAO Xin, GUO Wen-zhong.Real-Time Action Detection Based on Spatio-Temporal Interaction Perception[J].Acta Electronica Sinica, 2024, 52(02): 574-588. DOI：10.12263/DZXB.20220859

摘要

时空动作检测依赖于视频空间信息与时间信息的学习.目前，最先进的基于卷积神经网络（Convolutionsl Neural Networks，CNN）的动作检测器采用2D CNN或3D CNN架构，取得了显著的效果.然而，由于网络结构的复杂性与时空信息感知的原因，这些方法通常采用非实时、离线的方式.时空动作检测主要的挑战在于设计高效的检测网络架构，并能有效地感知融合时空特征.考虑到上述问题，本文提出了一种基于时空交叉感知的实时动作检测方法.该方法首先通过对输入视频进行乱序重排来增强时序信息，针对仅使用2D或3D骨干网络无法有效对时空特征进行建模，提出了基于时空交叉感知的多分支特征提取网络.针对单一尺度时空特征描述性不足，提出一个多尺度注意力网络来学习长期的时间依赖和空间上下文信息.针对时序和空间两种不同来源特征的融合，提出了一种新的运动显著性增强融合策略，对时空信息进行编码交叉映射，引导时序特征和空间特征之间的融合，突出更具辨别力的时空特征表示.最后，基于帧级检测器结果在线计算动作关联性链接.本文提出的方法在两个时空动作数据集UCF101-24和JHMDB-21上分别达到了84.71%和78.4%的准确率，优于现有最先进的方法，并达到73帧/秒的速度.此外，针对JHMDB-21数据集存在高类间相似性与难样本数据易于混淆等问题，本文提出了基于动作表示的关键帧光流动作检测方法，避免了冗余光流的产生，进一步提升了动作检测准确率.

Abstract

Spatiotemporal action detection requires incorporation of video spatial and temporal information. Current state-of-the-art approaches usually use a 2D CNN (Convolutionsl Neural Networks) or a 3D CNN architecture. However

due to the complexity of network structure and spatiotemporal information extraction

these methods are usually non-real-time and offline. To solve this problem

this paper proposes a real-time action detection method based on spatiotemporal interaction perception. First of all

the input video is rearranged out of order to enhance the temporal information. As 2D or 3D backbone networks cannot be used to model spatiotemporal features effectively

a multi-branch feature extraction network is proposed to extract features from different sources. And a multi-scale attention network is proposed to extract long-term time-dependent and spatial context information. Then

for the fusion of temporal and spatial features from two different sources

a new motion saliency enhancement fusion strategy is proposed

which guides the fusion between features by encoding temporal and spatial features to highlight more discriminative spatiotemporal features. Finally

action tube links are generated online based on the frame-level detector results. The proposed method achieves an accuracy of 84.71% and 78.4% on two spatiotemporal motion datasets UCF101-24 and JHMDB-21. And it provides a speed of 73 frames per second

which is superior to the state-of-the-art methods. In addition

for the problems of high inter-class similarity and easy confusion of difficult sample data in the JHMDB-21 dataset

this paper proposes an action detection method of key frame optical flow based on action representation

which avoids the generation of redundant optical flow and further improves the accuracy of action detection.

关键词

Keywords

references

SHAO D , ZHAO Y , DAI B , et al . FineGym: A hierarchical video dataset for fine-grained action understanding [C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 2613 - 2622 .

罗会兰 , 童康 , 孔繁胜 . 基于深度学习的视频中人体动作识别进展综述 [J]. 电子学报 , 2019 , 47 ( 5 ): 1162 - 1173 .

LUO H L , TONG K , KONG F S . The progress of human action recognition in videos based on deep learning: A review [J]. Acta Electronica Sinica , 2019 , 47 ( 5 ): 1162 - 1173 . (in Chinese)

杨珂 , 王敬宇 , 戚琦 , 等 . LSCN: 一种用于动作识别的长短时序关注网络 [J]. 电子学报 , 2020 , 48 ( 3 ): 503 - 509 .

YANG K , WANG J Y , QI Q , et al . LSCN: Concerning long and short sequence together for action recognition [J]. Acta Electronica Sinica , 2020 , 48 ( 3 ): 503 - 509 . (in Chinese)

XU M Z , XIONG Y J , CHEN H , et al . Long short-term transformer for online action detection [C]// Advances in Neural Information Processing Systems . Red Hook : Curran Associates, Inc. , 2021 : 1086 - 1099 .

DAI R , DAS S , KAHATAPITIYA K , et al . MS-TCT: Multi-scale temporal ConvTransformer for action detection [C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 20009 - 20019 .

PAN J T , CHEN S Y , SHOU M Z , et al . Actor-context-actor relation network for spatio-temporal action localization [C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 464 - 474 .

FEICHTENHOFER C , PINZ A , ZISSERMAN A . Convolutional two-stream network fusion for video action recognition [C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 1933 - 1941 .

ZHAO J J , SNOEK C G M . Dance with flow: Two-In-one stream action detection [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 9927 - 9936 .

LI H H , JIANG X D , GUAN B L , et al . Joint feature optimization and fusion for compressed action recognition [J]. IEEE Transactions on Image Processing , 2021 , 30 : 7926 - 7937 .

SHOU Z , LIN X D , KALANTIDIS Y , et al . DMC-net: Generating discriminative motion cues for fast compressed video action recognition [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 1268 - 1277 .

TAO L , WANG X T , YAMASAKI T . Rethinking motion representation: Residual frames with 3D ConvNets [J]. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society , 2021 , 30 : 9231 - 9244 .

桑海峰 , 赵子裕 , 何大阔 . 基于循环区域关注和视频帧关注的视频行为识别网络设计 [J]. 电子学报 , 2020 , 48 ( 6 ): 1052 - 1061 .

SANG H F , ZHAO Z Y , HE D K . Recurrent region attention and video frame attention based video action recognition network design [J]. Acta Electronica Sinica , 2020 , 48 ( 6 ): 1052 - 1061 . (in Chinese)

QIU Z F , YAO T , MEI T . Learning spatio-temporal representation with pseudo-3D residual networks [C]// 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 5534 - 5542 .

TRAN D , WANG H , TORRESANI L , et al . A closer look at spatiotemporal convolutions for action recognition [C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 6450 - 6459 .

DONAHUE J , HENDRICKS L A , GUADARRAMA S , et al . Long-term recurrent convolutional networks for visual recognition and description [C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2015 : 2625 - 2634 .

LI Y H , SONG S J , LI Y Q , et al . Temporal bilinear networks for video action recognition [J]. Proceedings of the AAAI Conference on Artificial Intelligence , 2019 , 33 ( 1 ): 8674 - 8681 .

LIN J , GAN C , HAN S . TSM: temporal shift module for efficient video understanding [C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 7082 - 7092 .

LI Y , JI B , SHI X T , et al . TEA: Temporal excitation and aggregation for action recognition [C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 906 - 915 .

SIMONYAN K , ZISSERMAN A . Two-stream convolutional networks for action recognition in videos [C]// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1 . New York : ACM , 2014 : 568 - 576 .

LIU X L , WANG Q M , HU Y , et al . End-to-end temporal action detection with transformer [J]. IEEE Transactions on Image Processing , 2022 , 31 : 5427 - 5441 .

MIRIAM JACOB G , STENGER B . Facial action unit detection with transformers [C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 7676 - 7685 .

GKIOXARI G , MALIK J . Finding action tubes [C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2015 : 759 - 768 .

HOU R , CHEN C , SHAH M . Tube convolutional neural network (T-CNN) for action detection in videos [C]// 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 5823 - 5832 .

LIU Y , YANG F , GINHAC D . ACDnet: An action detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation [J]. Pattern Recognition Letters , 2021 , 145 : 118 - 126 .

SINGH G , SAHA S , SAPIENZA M , et al . Online real-time multiple spatiotemporal action localisation and prediction [C]// 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 3657 - 3666 .

SAHA S M , SINGH G , SAPIENZA M , et al . Deep learning for detecting multiple space-time action tubes in videos [C]// Proceedings of the British Machine Vision Conference . New York : British Machine Vision Association , 2016 : 58 .

KALOGEITON V , WEINZAEPFEL P , FERRARI V , et al . Action tubelet detector for spatio-temporal action localization [C]// 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 4415 - 4423 .

LI Y X , WANG Z X , WANG L M , et al . Actions as moving points [C]// European Conference on Computer Vision . Cham : Springer , 2020 : 68 - 84 .

KUMAR A , RAWAT Y S . End-to-end semi-supervised learning for video action detection [C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 14680 - 14690 .

YANG X T , YANG X D , LIU M Y , et al . STEP: Spatio-temporal progressive learning for video action detection [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 264 - 272 .

LIU Y , YANG F , GINHAC D . TEDdet: Temporal feature exchange and difference network for online real-time action detection [J]. IEEE Access , 2022 , 10 : 37870 - 37881 .

胡正平 , 刁鹏成 , 张瑞雪 , 等 . 3D多支路聚合轻量网络视频行为识别算法研究 [J]. 电子学报 , 2020 , 48 ( 7 ): 1261 - 1268 .

HU Z P , DIAO P C , ZHANG R X , et al . Research on 3D multi-branch aggregated lightweight network video action recognition algorithm [J]. Acta Electronica Sinica , 2020 , 48 ( 7 ): 1261 - 1268 . (in Chinese)

WANG Z W , SHE Q , SMOLIC A . ACTION-net: Multipath excitation for action recognition [C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 13209 - 13218 .

PRAMONO R R A , CHEN Y T , FANG W H . Spatial-temporal action localization with hierarchical self-attention [J]. IEEE Transactions on Multimedia , 2021 , 24 : 625 - 639 .

罗会兰 , 王婵娟 . 行为识别中一种基于融合特征的改进VLAD编码方法 [J]. 电子学报 , 2019 , 47 ( 1 ): 49 - 58 .

LUO H L , WANG C J . An improved VLAD coding method based on fusion feature in action recognition [J]. Acta Electronica Sinica , 2019 , 47 ( 1 ): 49 - 58 . (in Chinese)

WANG X L , GIRSHICK R , GUPTA A , et al . Non-local neural networks [C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 7794 - 7803 .

YUE K Y , SUN M , YUAN Y C , et al . Compact generalized non-local network [C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems . New York : ACM , 2018 : 6511 - 6520 .

CAO Y , XU J R , LIN S , et al . GCNet: Non-local networks meet squeeze-excitation networks and beyond [C]// 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) . Piscataway : IEEE , 2019 : 1971 - 1980 .

CHEN Y P , KALANTIDIS Y , LI J S , et al . A2-nets: Double attention networks [C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems . Red Hook : Curran Associates Inc. , 2018 : 350 - 359 .

LI X , ZHONG Z S , WU J L , et al . Expectation-maximization attention networks for semantic segmentation [C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 9166 - 9175 .

CHEN W L , ZHU X G , SUN R Q , et al . Tensor low-rank reconstruction for semantic segmentation [C]// European Conference on Computer Vision . Cham : Springer , 2020 : 52 - 69 .

LIN T Y , DOLLÁR P , GIRSHICK R , et al . Feature pyramid networks for object detection [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 936 - 944 .

CHEN Q , WANG Y M , YANG T , et al . You only look one-level feature [C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 13034 - 13043 .

KÖPÜKLÜ O , WEI X Y , RIGOLL G . You only watch once: A unified CNN architecture for real-time spatiotemporal action localization [EB/OL]. ( 2019-11-15 )[ 2022-07-16 ]. https://arxiv.org/abs/1911.06644v5 https://arxiv.org/abs/1911.06644v5 .

SOOMRO K , ZAMIR A R , SHAH M . UCF101: A dataset of 101 human actions classes from videos in the wild [EB/OL]. ( 2012-12-03 )[ 2022-07-16 ]. https://arxiv.org/abs/1212.0402 https://arxiv.org/abs/1212.0402 .

KUEHNE H , JHUANG H , GARROTE E , et al . HMDB: A large video database for human motion recognition [C]// 2011 International Conference on Computer Vision . Piscataway : IEEE , 2011 : 2556 - 2563 .

TEED Z , DENG J . RAFT: Recurrent all-pairs field transforms for optical flow [C]// European Conference on Computer Vision . Cham : Springer , 2020 : 402 - 419 .

ZHANG H C , ZHAO X . Spatio-temporal motion aggregation network for video action detection [C]// ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 2180 - 2184 .

ZHANG D J , HE L C , TU Z G , et al . Learning motion representation for real-time spatio-temporal action localization [J]. Pattern Recognition , 2020 , 103 : 107312 .

PRAMONO R R A , CHEN Y T , FANG W H . Hierarchical self-attention network for action localization in videos [C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 61 - 70 .

DIBA A , SHARMA V , VAN GOOL L . Deep temporal linear encoding networks [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 1541 - 1550 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于时空交叉感知的实时动作检测方法