Enhanced Spatial-Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

JIANG Wei; GUAN Meng-yi; WEI Fu-peng; SUN Hao-chen; MENG Yao; WU Hui-xin

doi:10.12263/DZXB.20250259

您当前的位置：

首页 >

文章列表页 >

Enhanced Spatial-Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

PAPERS | 更新时间：2026-02-05

- Enhanced Spatial-Temporal Graph Convolutional Network for Skeleton-Based Action Recognition
- ACTA ELECTRONICA SINICA Vol. 53, Issue 10, Pages: 3692-3704(2025)
- 作者机构：
  
  华北水利水电大学信息工程学院，河南郑州 450000
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(42371466);Key Research Projects of Henan Higher Education Institutions(23A520031;24A520020;25B520012);Science and Technology Plan Project of Housing and Urban-Rural Development in Henan Province(HNJS-2024-K35)
- DOI：10.12263/DZXB.20250259
  CLC： TP391;
- Received：06 April 2025，
  
  Accepted：09 October 2025，
  
  Published：25 October 2025
- 稿件说明：
移动端阅览
姜维, 关孟怡, 魏富鹏, 等. 基于增强时空图卷积网络的骨架行为识别[J]. 电子学报, 2025, 53(10): 3692-3704.

JIANG Wei, GUAN Meng-yi, WEI Fu-peng, et al. Enhanced Spatial-Temporal Graph Convolutional Network for Skeleton-Based Action Recognition[J]. Acta Electronica Sinica, 2025, 53(10): 3692-3704.
姜维, 关孟怡, 魏富鹏, 等. 基于增强时空图卷积网络的骨架行为识别[J]. 电子学报, 2025, 53(10): 3692-3704. DOI：10.12263/DZXB.20250259

JIANG Wei, GUAN Meng-yi, WEI Fu-peng, et al. Enhanced Spatial-Temporal Graph Convolutional Network for Skeleton-Based Action Recognition[J]. Acta Electronica Sinica, 2025, 53(10): 3692-3704. DOI：10.12263/DZXB.20250259

摘要

图卷积网络（Graph Convolutional Network，GCN）被广泛应用在基于骨架序列的行为识别方法中，并取得显著效果.然而，随着行为种类和场景复杂度的增加，现有方法在建模人体结构细节与时序依赖方面仍面临诸多挑战，具体表现为以下两个问题：其一，在提取关节间的关联特征时，往往未能充分反映边缘处关节（双手、双脚与头部）之间的相互作用以及边缘处关节与其他关节之间的协同效应；其二，在提取时间特征时，局限于短期时间特征的提取，未能有效捕获长期时序依赖关系.针对以上问题，本文提出一种增强时空图卷积网络模型（Enhanced Spatial-Temporal Graph Convolutional Network，EST-GCN），它由多分支空间增强图卷积（Multi-branch Spatial Enhanced Graph Convolution，MSEGC）模块和多尺度时间增强卷积（Multi-scale Temporal Enhanced Convolution，MTEC）模块堆叠组成.MSEGC通过多阶段学习并传递双流图卷积下的特征，以增强边缘处关节的特征表达能力，从而捕获边缘处关节与其他关节之间的关系；MTEC通过多阶段学习并传递多尺度时间卷积下的时间特征，扩大时间跨度，从而捕获时间帧之间更广泛的时序依赖关系.模型依次通过MSEGC与MTEC提取并融合空间与时间特征，协同建模关节结构关联与时序依赖，提升时空特征判别性.为充分挖掘骨架数据的时空特征，在输入设计上，本文引入关节位置、运动速度与骨骼3类特征，并采用多流融合方式以增强特征表示能力.本文所提出的方法，在NTU-RGB+D数据集的X-Sub与X-View基准上，分别实现了92.4%与96.2%的准确率；在NTU-RGB+D 120数据集的X-Sub与X-Setup基准上，分别达到了88.7%和90.0%的准确率，证明了该方法的有效性.此外，为进一步验证模型在真实场景下的人体行为识别性能，本文基于NTU-RGB+D数据集的视频样本开展了骨架行为识别实验，并在多人交互及关节噪声干扰条件下进行了额外测试.实验结果表明，即使在局部关节出现错乱分配的情况下，模型仍能实现准确识别，验证了所提方法的实用性与鲁棒性.

Abstract

Graph convolutional network (GCN) has been extensively applied to skeleton-based action recognition and have achieved remarkable performance. However

as the number of action categories and scene complexity increase

existing methods still face significant challenges in modeling detailed human body structures and temporal dependencies

which can be summarized as two main issues. Firstly

when extracting relational features among joints

these methods often inadequately capture the interactions between peripheral joints (such as hands

feet

and head) and their synergistic effects with other joints. Secondly

when extracting temporal features

these methods focus on short-term temporal feature extraction neglecting of long-term dependencies. To address these issues

this paper proposes an enhanced spatiotemporal graph convolutional network (EST-GCN)

which consists of multi-branch spatial enhanced graph convolution (MSEGC) and multi-scale temporal enhanced convolution (MTEC) modules. The MSEGC module enhances the feature representation of peripheral joints by capturing relationships between peripheral joints and others through multi-stage learning and propagation within a two-stream graph convolution framework. Meanwhile

the MTEC module effectively captures long-term temporal dependencies across frames through multi-stage learning and propagation of temporal features from multi-scale convolutions

thereby expanding the temporal receptive field. The model sequentially extracts and fuses spatial and temporal features via MSEGC and MTEC

jointly modeling joint structural correlations and temporal dependencies to improve the discriminability of spatial-temporal features. To fully exploit the spatial-temporal information of skeleton data

three types of input features—joint positions

motion velocities

and bone features—are introduced and fused through a multi-stream strategy to enhance feature representation. The proposed method achieves accuracies of 92.4% and 96.2% on the X-Sub and X-View benchmarks of the NTU-RGB+D dataset

respectively; and 88.7% and 90.0% on the X-Sub and X-Setup benchmarks of the NTU-RGB+D 120 dataset

which validates its effectiveness. Furthermore

to validate the model’s performance in real-world scenarios

additional skeleton-based action recognition experiments are conducted on video samples from the NTU-RGB+D dataset

including tests under multi-person interactions and joint noise interference. The results show that the model can still achieve accurate recognition even when local joint misassignments occur

further verifying the practicality and robustness of the proposed approach.

关键词

Keywords

references

YAN S J , XIONG Y J , LIN D H . Spatial temporal graph convolutional networks for skeleton-based action recognition [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2018 , 32 ( 1 ): 7444 - 7452 .

SHI L , ZHANG Y F , CHENG J , et al . Two-stream adaptive graph convolutional networks for skeleton-based action recognition [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 12018 - 12027 .

赵俊男 , 佘青山 , 孟明 , 等 . 基于多流空间注意力图卷积SRU网络的骨架动作识别 [J ] . 电子学报 , 2022 , 50 ( 7 ): 1579 - 1585 .

ZHAO J N , SHE Q S , MENG M , et al . Skeleton action recognition based on multi-stream spatial attention graph convolutional SRU network [J ] . Acta Electronica Sinica , 2022 , 50 ( 7 ): 1579 - 1585 . (in Chinese)

LI C , ZHONG Q Y , XIE D , et al . Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation [EB/OL ] . ( 2018-04-17 )[ 2025-05-25 ] . https://arXiv.org/abs/1804.06055 https://arXiv.org/abs/1804.06055 .

LIU Z Y , ZHANG H W , CHEN Z H , et al . Disentangling and unifying graph convolutions for skeleton-based action recognition [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 143 - 152 .

LI M S , CHEN S H , CHEN X , et al . Symbiotic graph neural networks for 3D skeleton-based human action recognition and motion prediction [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 44 ( 6 ): 3316 - 3333 .

SHAHROUDY A , LIU J , NG T T , et al . NTU RGB+D: A large scale dataset for 3D human activity analysis [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2016 : 1010 - 1019 .

LIU J , SHAHROUDY A , PEREZ M , et al . NTU RGB D 120: A large-scale benchmark for 3D human activity understanding [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2020 , 42 ( 10 ): 2684 - 2701 .

ZHUO J M , CUI C , FU K , et al . Propagation is all you need: A new framework for representation learning and classifier training on graphs [C ] // Proceedings of the 31st ACM International Conference on Multimedia . New York : ACM , 2023 : 481 - 489 .

MONDAL A , SHASHANT R , GIRALDO J H , et al . Moving object detection for event-based vision using graph spectral clustering [C ] // 2021 IEEE/CVF International Conference on Computer Vision Workshops . Piscataway : IEEE , 2021 : 876 - 884 .

YING R , YOU J X , MORRIS C , et al . Hierarchical graph representation learning with differentiable pooling [C ] // Advances in Neural Information Processing Systems 31 . San Diego : NeurIPS , 2018 : 4800 - 4810 .

HU J F , ZHENG W S , LAI J H , et al . Jointly learning heterogeneous features for RGB-D activity recognition [C ] // 2015 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2015 : 5344 - 5352 .

ZHANG S Y , LIU X M , XIAO J . On geometric features for skeleton-based action recognition using multilayer LSTM networks [C ] // 2017 IEEE Winter Conference on Applications of Computer Vision . Piscataway : IEEE , 2017 : 148 - 157 .

SOO K , REITER A . Interpretable 3D human action analysis with temporal convolutional networks [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2017 : 1623 - 1631 .

YANG H , YAN D , ZHANG L , et al . Feedback graph convolutional network for skeleton-based action recognition [J ] . IEEE Transactions on Image Processing , 2022 , 31 : 164 - 175 .

LEE J , LEE M , LEE D , et al . Hierarchically decomposed graph convolutional networks for skeleton-based action recognition [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 10410 - 10419 .

罗会兰 , 曹立京 . 基于多维动态拓扑学习图卷积的骨架动作识别 [J ] . 电子学报 , 2024 , 52 ( 3 ): 991 - 1001 .

LUO H L , CAO L J . Multi-dimensional dynamic topology learning graph convolution for skeleton-based action recognition [J ] . Acta Electronica Sinica , 2024 , 52 ( 3 ): 991 - 1001 . (in Chinese)

XIA Y , GAO Q Y , WU W G , et al . Skeleton-based action recognition based on multidimensional adaptive dynamic temporal graph convolutional network [J ] . Engineering Applications of Artificial Intelligence , 2024 , 127 : 107210 .

WU Z Z , SUN P P , CHEN X , et al . SelfGCN: Graph convolution network with self-attention for skeleton-based action recognition [J ] . IEEE Transactions on Image Processing , 2024 , 33 : 4391 - 4403 .

XIE J Y , MENG Y D , ZHAO Y T , et al . Dynamic semantic-based spatial graph convolution network for skeleton-based human action recognition [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 6 ): 6225 - 6233 .

ZHU Y S , SHUAI H , LIU G C , et al . Multilevel spatial-temporal excited graph network for skeleton-based action recognition [J ] . IEEE Transactions on Image Processing , 2023 , 32 : 496 - 508 .

SONG Y F , ZHANG Z , SHAN C F , et al . Constructing stronger and faster baselines for skeleton-based action recognition [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 2 ): 1474 - 1488 .

ALSARHAN T , ALI U , LU H T . Enhanced discriminative graph convolutional network with adaptive temporal modelling for skeleton-based action recognition [J ] . Computer Vision and Image Understanding , 2022 , 216 : 103348 .

DING C Y , WEN S , DING W W , et al . Temporal segment graph convolutional networks for skeleton-based action recognition [J ] . Engineering Applications of Artificial Intelligence , 2022 , 110 : 104675 .

CHENG K , ZHANG Y F , HE X Y , et al . Skeleton-based action recognition with shift graph convolutional network [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 183 - 192 .

ZHANG P F , LAN C L , ZENG W J , et al . Semantics-guided neural networks for efficient skeleton-based human action recognition [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 1112 - 1121 .

PLIZZARI C , CANNICI M , MATTEUCCI M . Skeleton-based action recognition via spatial and temporal transformer networks [J ] . Computer Vision and Image Understanding , 2021 , 208 / 209 : 103219 .

CHEN Z , LI S C , YANG B , et al . Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 2 ): 1113 - 1122 .

HUANG Z X , QIN Y S , LIN X B , et al . Motion-driven spatial and temporal adaptive high-resolution graph convolutional networks for skeleton-based action recognition [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2023 , 33 ( 4 ): 1868 - 1883 .

WEN Y H , GAO L , FU H B , et al . Motif-GCNs with local and non-local temporal blocks for skeleton-based action recognition [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 2 ): 2009 - 2023 .

JIANG Y J , DENG H M . Lighter and faster: A multi-scale adaptive graph convolutional network for skeleton-based action recognition [J ] . Engineering Applications of Artificial Intelligence , 2024 , 132 : 107957 .

CHEN H , SHEN Y H , ZHANG Y X , et al . Skeleton-based action recognition through dual-granularity feature fusion with self-adapting graph convolution and multi-scale temporal convolution [J ] . Neurocomputing , 2025 , 639 : 130261 .

CHI H G , HA M H , CHI S , et al . InfoGCN: Representation learning for human skeleton-based action recognition [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 20154 - 20164 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

The Multi-Behavior Graph Contrastive Learning Recommendation Method with Self-Attention Mechanism

A Rumor Detection Approach Based on Multi-Feature Propagation Tree

Few-Shot Learning on Graph Convolutional Network Based on Meta learning

Cross-Modality Person Re-identification Based on Locally Heterogeneous Polymerization Graph Convolutional Network

MOCVD GROWTH AND CHARACTERIZATION OF InP DOPING SUPERLATTICES

Related Author

WAN Zi-long

HUANG Heng

QIAN Zhong-sheng

MAO Qin-jiao

PAN Shan-liang

ZHANG Xin-xin

LIU Xin-lei

FENG Lin

Related Institution

School of Computer and Artificial Intelligence, Jiangxi University of Finance & Economics

College of Cyberspace Security, Ningbo University of Technology

College of Information Science and Engineering,Ningbo University

Network Security Detachment， Ziyang Public Security Bureau

School of Business， Sichuan Normal University

⁰