Bimodal Action Recognition Based on Spatiotemporal Adaptive Fusion

QING Yu-han; GAO Chen-qiang; TAN Zhuo-lin; LIU Fang-cen

doi:10.12263/DZXB.20250026

您当前的位置：

首页 >

文章列表页 >

Bimodal Action Recognition Based on Spatiotemporal Adaptive Fusion

PAPERS | 更新时间：2025-12-10

- Bimodal Action Recognition Based on Spatiotemporal Adaptive Fusion
- ACTA ELECTRONICA SINICA Vol. 53, Issue 7, Pages: 2389-2400(2025)
- 作者机构：
  
  1.重庆邮电大学通信与信息工程学院，重庆 400065
  2.中山大学·深圳智能工程学院，广东深圳 518107
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62176035);Shenzhen Fundamental Research Program(JCYJ20240813151216022)
- DOI：10.12263/DZXB.20250026
  CLC： TP39;
- Received：07 January 2025，
  
  Revised：2025-06-09，
  
  Published：25 July 2025
- 稿件说明：
移动端阅览
卿宇寒, 高陈强, 谭卓林, 等. 基于时空自适应融合的双模行为识别[J]. 电子学报, 2025, 53(07): 2389-2400.

QING Yu-han, GAO Chen-qiang, TAN Zhuo-lin, et al. Bimodal Action Recognition Based on Spatiotemporal Adaptive Fusion[J]. Acta Electronica Sinica, 2025, 53(07): 2389-2400.
卿宇寒, 高陈强, 谭卓林, 等. 基于时空自适应融合的双模行为识别[J]. 电子学报, 2025, 53(07): 2389-2400. DOI：10.12263/DZXB.20250026

QING Yu-han, GAO Chen-qiang, TAN Zhuo-lin, et al. Bimodal Action Recognition Based on Spatiotemporal Adaptive Fusion[J]. Acta Electronica Sinica, 2025, 53(07): 2389-2400. DOI：10.12263/DZXB.20250026

摘要

双模行为识别旨在通过学习不同数据模态间的互补信息，弥补单一模态的局限性，提升复杂场景下的行为识别性能.现有方法通常采用独立主干网络分别提取各模态特征后再融合，但未能充分考虑模态间的语义差异（如特征不对齐），且难以有效处理模态遮挡问题，导致融合过程中易引入干扰并影响识别性能.为此，本文提出一种基于时空自适应融合的双模行为识别方法.具体而言，本文设计了时序关键帧选择模块，通过竞争机制突出时序关键帧；同时提出空间显著区域选择模块，自适应筛选模态间有效特征区域以抑制无关信息干扰，进而引导网络高效学习动作相关的时空特征.此外，本文引入自蒸馏机制，结合预测分布损失和区域蒸馏损失，引导网络聚焦关键动作区域.为进一步优化双模态特征融合效果，本文设计自适应掩码融合模块，在多头自注意力和多层感知器计算中，通过掩码过滤无效区域，降低其对特征融合的负面影响.相比于基线方法，本文方法在InfRA和NTU RGB+D数据集上Top-1准确率分别提升3.75%和3.49%，验证了网络能有效实现双模态特征的自适应选择与融合，提升行为识别性能.

Abstract

Bimodal action recognition aims to enhance recognition performance in complex scenarios by leveraging complementary information across different data modalities to overcome the limitations of single-modal approaches. Existing methods typically adopt independent backbone networks to extract features from each modality separately before performing feature fusion. However

they often fail to adequately address semantic discrepancies between modalities

such as cross-modal feature misalignment and representational inconsistency

which can introduce noise during the fusion process and degrade recognition accuracy. To address these issues

this paper proposes a spatiotemporal adaptive fusion framework for bimodal action recognition. Specifically

a temporal keyframe selection module is introduced to identify and emphasize informative frames through a competitive mechanism. Simultaneously

a spatial salient region selection module adaptively filters discriminative regions across modalities

suppressing irrelevant information and guiding the network to learn more robust spatiotemporal representations. In addition

a self-distillation mechanism is employed to reinforce the network’s focus on action-relevant features

incorporating both prediction distribution loss and region-level distillation loss to facilitate fine-grained feature optimization. To further improve the fusion quality

an adaptive mask fusion module is proposed

which attenuates the influence of uninformative regions by applying learnable masks within the multi-head self-attention and multi-layer perceptron computations. Experimental results on the InfRA and NTU RGB+D datasets demonstrate that the proposed method achieves Top-1 accuracy improvements of 3.75% and 3.49%

respectively

compared to baseline models

validating the effectiveness of the proposed framework in adaptively selecting and integrating bimodal features for improved action recognition.

关键词

Keywords

references

ELHARROUSS O , ALMAADEED N , AL-MAADEED S . A review of video surveillance systems [J ] . Journal of Visual Communication and Image Representation , 2021 , 77 : 103116 .

GAO C Q , DU Y H , LIU J , et al . InfAR dataset: Infrared action recognition at different times [J ] . Neurocomputing , 2016 , 212 : 36 - 47 .

罗会兰 , 童康 , 孔繁胜 . 基于深度学习的视频中人体动作识别进展综述 [J ] . 电子学报 , 2019 , 47 ( 5 ): 1162 - 1173 .

LUO H L , TONG K , KONG F S . The progress of human action recognition in videos based on deep learning: A review [J ] . Acta Electronica Sinica , 2019 , 47 ( 5 ): 1162 - 1173 . (in Chinese)

LIN J , GAN C , HAN S . TSM: Temporal shift module for efficient video understanding [C ] // 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2020 : 7082 - 7092 .

FEICHTENHOFER C , FAN H Q , MALIK J , et al . SlowFast networks for video recognition [C ] // 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2019 : 6201 - 6210 .

TRAN D , WANG H , FEISZLI M , et al . Video classification with channel-separated convolutional networks [C ] // 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2020 : 5551 - 5560 .

LIU Z Y , WANG L M , WU W , et al . TAM: Temporal adaptive module for video recognition [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2022 : 13688 - 13698 .

BERTASIUS G , WANG H , TORRESANI L . Is space-time attention all you need for video understanding? [EB/OL ] . ( 2021-06-09 )[ 2024-12-16 ] . https://arXiv.org/abs/2102.05095 https://arXiv.org/abs/2102.05095 .

LIU Z , NING J , CAO Y , et al . Video swin transformer [C ] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 3202 - 3211 .

LI Y H , WU C Y , FAN H Q , et al . MViTv2: Improved multiscale vision transformers for classification and detection [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 4794 - 4804 .

RYALI C , HU Y T , BOLYA D , et al . Hiera: A hierarchical vision transformer without the bells-and-whistles [EB/OL ] . ( 2023-06-01 )[ 2024-12-16 ] . https://arXiv.org/abs/2306.00989 https://arXiv.org/abs/2306.00989 .

WASIM S T , KHATTAK M U , NASEER M , et al . Video-FocalNets: Spatio-temporal focal modulation for video action recognition [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2024 : 13732 - 13743 .

DE BOISSIERE A M , NOUMEIR R . Infrared and 3D skeleton feature fusion for RGB-D action recognition [J ] . IEEE Access , 2020 , 8 : 168297 - 168308 .

XIAO X J , REN Z L , LI H , et al . SlowFast multimodality compensation fusion swin transformer networks for RGB-D action recognition [J ] . Mathematics , 2023 , 11 ( 9 ): 2115 .

WU H B , MA X , LI Y B . Spatiotemporal multimodal learning with 3D CNNs for video action recognition [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2022 , 32 ( 3 ): 1250 - 1261 .

CHENG J , REN Z L , ZHANG Q S , et al . Cross-modality compensation convolutional neural networks for RGB-D action recognition [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2022 , 32 ( 3 ): 1498 - 1509 .

SONG S J , LIU J Y , LI Y H , et al . Modality compensation network: Cross-modal adaptation for action recognition [J ] . IEEE Transactions on Image Processing , 2020 , 29 : 3957 - 3969 .

CAO B , XIA Y , DING Y , et al . Predictive dynamic fusion [EB/OL ] . ( 2024-06-07 ) [ 2025-03-31 ] . https://arxiv.org/abs/2406.04802 https://arxiv.org/abs/2406.04802 .

HAN Z B , ZHANG C Q , FU H Z , et al . Trusted multi-view classification with dynamic evidential fusion [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 2 ): 2551 - 2566 .

XUE Z H , MARCULESCU R . Dynamic multimodal fusion [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2023 : 2575 - 2584 .

ZHANG Q Y , WU H T , ZHANG C Q , et al . Provable dynamic fusion for low-quality multimodal data [C ] // Proceedings of the 40 th International Conference on Machine Learning . Honolulu : PMLR , 2023 : 41753 - 41769 .

郑云飞 , 王晓兵 , 张雄伟 , 等 . 基于金字塔知识的自蒸馏HRNet目标分割方法 [J ] . 电子学报 , 2023 , 51 ( 3 ): 746 - 756 .

ZHENG Y F , WANG X B , ZHANG X W , et al . The self-distillation HRNet object segmentation based on the pyramid knowledge [J ] . Acta Electronica Sinica , 2023 , 51 ( 3 ): 746 - 756 . (in Chinese)

SHAHROUDY A , LIU J , NG T T , et al . NTU RGB+D: A large scale dataset for 3D human activity analysis [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2016 : 1010 - 1019 .

柯逍 , 缪欣 , 郭文忠 . 基于时空交叉感知的实时动作检测方法 [J ] . 电子学报 , 2024 , 52 ( 2 ): 574 - 588 .

KE X , MIAO X , GUO W Z . Real-time action detection based on spatio-temporal interaction perception [J ] . Acta Electronica Sinica , 2024 , 52 ( 2 ): 574 - 588 . (in Chinese)

KORBAR B , TRAN D , TORRESANI L . SCSampler: Sampling salient clips from video for efficient action recognition [C ] // 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2020 : 6231 - 6241 .

JIANG M , PAN N , KONG J . Spatial-temporal saliency action mask attention network for action recognition [J ] . Journal of Visual Communication and Image Representation , 2020 , 71 : 102846 .

ZHI Y , TONG Z , WANG L M , et al . MGSampler: An explainable sampling strategy for video action recognition [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2022 : 1493 - 1502 .

DOSOVITSKIY A , BEYER L , KOLESNIKOV A , et al . An image is worth 16 x 16 words: Transformers for image recognition at scale[EB/OL ] . ( 2021-06-03 )[ 2024-12-16 ] . https://arXiv.org/abs/2010.11929 https://arXiv.org/abs/2010.11929 .

RAO Y M , ZHAO W L , LIU B L , et al . DynamicViT: Efficient vision transformers with dynamic token sparsification [EB/OL ] . ( 2021-10-26 )[ 2024-12-16 ] . https://arXiv.org/abs/2106.02034 https://arXiv.org/abs/2106.02034 .

WANG J K , YANG X T , LI H D , et al . Efficient video transformers with spatial-temporal token selection [M ] // Computer Vision - ECCV 2022 . Cham : Springer Nature Switzerland , 2022 : 69 - 86 .

CHEN L , TONG Z , SONG Y B , et al . Efficient video action detection with token dropout and context refinement [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2024 : 10354 - 10365 .

ZHANG L F , SONG J B , GAO A N , et al . Be your own teacher: Improve the performance of convolutional neural networks via self distillation [C ] // 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2020 : 3712 - 3721 .

JI M , SHIN S , HWANG S , et al . Refine myself by teaching myself: Feature refinement via self-knowledge distillation [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2021 : 10659 - 10668 .

YU M Z , TAN S H , WU K L , et al . CORSD: Class-oriented relational self distillation [C ] // ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 1 - 5 .

ABNAR S , ZUIDEMA W . Quantifying attention flow in transformers [EB/OL ] . ( 2020-05-02 )[ 2024-12-16 ] . https://arxiv.org/abs/2005.00928 https://arxiv.org/abs/2005.00928 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Low-Cost Federated Learning Based on Lightweight Self-Distillation

Using Fuzzy Information Granulation for Spatio-temporal Salient Unit Detection in Video Sequences

Related Author

LIU Song

LUO Yang-yu

XU Jia-pei

ZHANG Jian-zhong

LANG Cong-yan

XU De

LI Bing

YANG Tian-wu

Related Institution

College of Computer Science, Nankai University

College of Cyber Science, Nankai University

Ministry of Education Key Laboratory of Data and Intelligent System Security

Institute of Computer Science and Engineering,Beijing Jiaotong University

西南交通大学信息科学与技术学院

⁰