MSGPose: Monocular Single-Person 3D Human Pose Estimation Via Multi-Semantic Graph Convolution and Graph-Guided State Space Models

LI Jun; LI Yu; CHEN Li

doi:10.12263/DZXB.20250860

您当前的位置：

首页 >

文章列表页 >

MSGPose: Monocular Single-Person 3D Human Pose Estimation Via Multi-Semantic Graph Convolution and Graph-Guided State Space Models

PAPERS | 更新时间：2026-06-16

- MSGPose: Monocular Single-Person 3D Human Pose Estimation Via Multi-Semantic Graph Convolution and Graph-Guided State Space Models
- ACTA ELECTRONICA SINICA Vol. 54, Issue 3, Pages: 1118-1131(2026)
- 作者机构：
  
  1.武汉科技大学计算机科学与技术学院，湖北武汉 430065
  2.智能信息处理与实时工业系统湖北省重点实验室，湖北武汉 430065
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62271359)
- DOI：10.12263/DZXB.20250860
  CLC： TP391.4;
- Received：28 September 2025，
  
  Accepted：28 February 2026，
  
  Published：25 March 2026
- 稿件说明：
移动端阅览
李俊, 李昱, 陈黎. MSGPose：基于多语义图卷积与图引导状态空间模型的单目单人三维人体姿态估计[J]. 电子学报, 2026, 54(03): 1118-1131.

LI Jun, LI Yu, CHEN Li. MSGPose: Monocular Single-Person 3D Human Pose Estimation Via Multi-Semantic Graph Convolution and Graph-Guided State Space Models[J]. Acta Electronica Sinica, 2026, 54(03): 1118-1131.
李俊, 李昱, 陈黎. MSGPose：基于多语义图卷积与图引导状态空间模型的单目单人三维人体姿态估计[J]. 电子学报, 2026, 54(03): 1118-1131. DOI：10.12263/DZXB.20250860

LI Jun, LI Yu, CHEN Li. MSGPose: Monocular Single-Person 3D Human Pose Estimation Via Multi-Semantic Graph Convolution and Graph-Guided State Space Models[J]. Acta Electronica Sinica, 2026, 54(03): 1118-1131. DOI：10.12263/DZXB.20250860

摘要

单目单人三维人体姿态估计在动作识别与人机交互等领域极具应用价值。然而，受固有深度模糊、严重自遮挡与成像噪声等因素影响，从含有误差的2D观测中鲁棒地恢复3D姿态仍是一大挑战。针对现有基于图卷积神经网络（Graph Convolutional Network， GCN）方法过度依赖单一且静态的物理骨架拓扑，难以充分表达左右对称性等非物理连接语义，以及基于自注意力的方法在长序列下因二次方复杂度导致计算冗余与参数量过大的问题，本文提出MSGPose，一种基于多语义动态分离图卷积（Multi-Semantic Dynamic Separable Graph Convolution，MSDG）与语义图引导的时空双向Mamba（Semantic Graph-guided Mamba block，SGM）的双流并行框架，用于联合建模并提取2D姿态序列的空间与时间特征。具体而言，MSDG模块不仅通过自连接、物理连接与左右对称性先验构建多层级语义图，为各语义分支配置独立权重以避免特征耦合，还引入带权修正矩阵动态缓解固定拓扑的归纳偏置。随后，结合基于K近邻策略构建的稀疏动态时序图卷积，自适应捕获复杂运动下的跨关节与跨帧依赖。同时，为弥补Mamba架构在空间拓扑建模上的不足，SGM模块在双向状态空间扫描前引入多语义图卷积引导，将解剖学结构先验显式注入序列表示中，为后续的状态空间模型提供了一个具备局部几何感知能力的特征空间，从而更高效地进行长程时空依赖建模。在特征融合与优化阶段，通过可学习的自适应权重对两条特征流进行互补融合，并采用三维位置损失与速度损失进行联合训练，以增强预测姿态的时间一致性与稳定性。在Human3.6M数据集上，MSGPose取得了38.9 mm的平均每关节位置误差（Mean Per Joint Position Error，MPJPE），相较于MotionBERT降低了0.3 mm。值得注意的是，MSGPose展现出极佳的参数效率，其参数量（13.3 M）仅约为MotionBERT的31%。同时，在场景更复杂的MPI-INF-3DHP数据集上，MSGPose的MPJPE降至14.5 mm，相较于MotionAGFormer进一步下降了1.7 mm。此外，在无噪声的真实2D标注输入下，MSGPose进一步将Human3.6M上的MPJPE显著降至12.7 mm。这充分验证了多语义先验与双流结构的结合，能有效提升2D到3D的姿态回归能力。

Abstract

Monocular single-person 3D human pose estimation holds immense value in action recognition and human-computer interaction. However

the lack of depth cues in monocular setups introduces inherent depth ambiguity

while self-occlusion and imaging noise further complicate the task. Robustly recovering 3D poses from erroneous 2D observations remains a major challenge. Existing graph convolutional network (GCN) methods often abstract the human skeleton into a predefined physical graph and mainly rely on this fixed topology. Consequently

they fail to express non-physical semantic relations

such as bilateral symmetry. Conversely

self-attention-based methods excel at long-range modeling. Yet

they suffer from quadratic complexity in long sequences. This complexity leads to increased computational cost and parameter overhead. To address these limitations

we propose MSGPose. This is a dual-stream parallel framework for parameter-efficient monocular 3D human pose estimation. It integrates a multi-semantic dynamic separable graph convolution (MSDG) module and a semantic graph-guided mamba (SGM) module. The framework jointly models and extracts spatial and temporal features from 2D pose sequences. The MSDG module tackles spatial and temporal relations. It constructs dynamic multi-level semantic graphs using three priors: self-connections

physical connections

and anatomical symmetry. MSDG assigns independent learnable weights to each semantic branch

which helps alleviate semantic feature coupling. Additionally

a weighted modification matrix dynamically mitigates the inductive bias of fixed topologies. For temporal dynamics

MSDG employs a sparse dynamic temporal graph convolution. It builds this graph using a K-Nearest Neighbors (K-NN) strategy based on feature similarity. This enables the modeling of cross-joint and inter-frame dependencies during complex movements. The SGM module addresses the spatial topology limitations of standard Mamba architectures. Flattening spatial tokens into 1D sequences may disrupt the natural topology of the human skeleton. To fix this

SGM introduces a multi-semantic graph convolution guidance mechanism. This mechanism operates before the causal 1D convolution and bidirectional state space scanning. This step explicitly injects anatomical structure priors into the sequence representation. It provides the subsequent state space model with a geometry-aware feature space. This enables efficient modeling of long-range spatio-temporal dependencies with linear complexity. During the feature fusion stage

MSGPose employs an adaptive mechanism. Learnable weights complementarily integrate the outputs from the two streams. The framework utilizes joint training optimized by 3D position and velocity losses. The velocity loss limits differences between adjacent frames. This strategy improves the temporal consistency and stability of the predicted poses. Extensive experiments demonstrate the effectiveness of MSGPose. On the Human3.6M dataset

it achieves a mean per joint position error (MPJPE) of 38.9 mm

representing a 0.3 mm improvement over MotionBERT while using only 13.3M parameters (approximately 31% of MotionBERT). On the challenging MPI-INF-3DHP dataset

MSGPose demonstrates strong generalization ability. It achieves an MPJPE of 14.5 mm

representing a 1.7 mm improvement over MotionAGFormer. Using noise-free ground-truth 2D annotations

the MPJPE on Human3.6M drops significantly to 12.7 mm. These results demonstrate the effectiveness of combining multi-semantic priors with a dual-stream architecture. This combination improves the performance of 2D-to-3D pose regression.

关键词

Keywords

references

李雨桐 , 马苗 , 陈建芮 . 融合动作描述生成与跨模态语义对齐的骨架动作识别方法 [J ] . 电子学报 , 2025 , 53 ( 11 ): 4116 - 4131 .

Li Yutong , Ma Miao , Chen Jianrui . Leveraging action description generation and cross-modal semantic alignment for skeleton-based action recognition [J ] . Acta Electronica Sinica , 2025 , 53 ( 11 ): 4116 - 4131 . (in Chinese)

Zheng J X , Shi X W , Gorban A , et al . Multi-modal 3D human pose estimation with 2D weak supervision in autonomous driving [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2022 : 4477 - 4486 . DOI: 10.1109/cvprw56347.2022.00494 http://dx.doi.org/10.1109/cvprw56347.2022.00494

Wang K Z , Lin L , Jiang C H , et al . 3D human pose machines with self-supervised learning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2020 , 42 ( 5 ): 1069 - 1082 .

Peng S , Hu J W . 3D human pose estimation in video with temporal and spatial transformer [C ] // International Conference on Image, Signal Processing, and Pattern Recognition . SPIE , 2023 : 136 . DOI: 10.1117/12.2681195 http://dx.doi.org/10.1117/12.2681195

Chen Y L , Wang Z C , Peng Y X , et al . Cascaded pyramid network for multi-person pose estimation [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 7103 - 7112 . DOI: 10.1109/cvpr.2018.00742 http://dx.doi.org/10.1109/cvpr.2018.00742

Newell A , Yang K Y , Deng J . Stacked hourglass networks for human pose estimation [M ] // Computer Vision - ECCV 2016 . ChamSpringer International Publishing2016: 483 - 499 . DOI: 10.1007/978-3-319-46484-8_29 http://dx.doi.org/10.1007/978-3-319-46484-8_29

Sun K , Xiao B , Liu D , et al . Deep high-resolution representation learning for human pose estimation [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2019 : 5686 - 5696 . DOI: 10.1109/cvpr.2019.00584 http://dx.doi.org/10.1109/cvpr.2019.00584

Kipf T N , Welling M . Semi-supervised classification with graph convolutional networks [PP/OL ] . V4. arXiv ( 2017-02-22 )[ 2025-09-28 ] . https://doi.org/10.48550/arXiv.1609.02907 https://doi.org/10.48550/arXiv.1609.02907 .

Ci H , Wang C Y , Ma X X , et al . Optimizing network structure for 3D human pose estimation [C ] // 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2019 : 2262 - 2271 . DOI: 10.1109/iccv.2019.00235 http://dx.doi.org/10.1109/iccv.2019.00235

Liu K K , Zou Z M , Tang W . Learning global pose features in graph convolutional networks for 3D human pose estimation [M ] // Computer Vision - ACCV 2020 . ChamSpringer International Publishing 2021 : 89 - 105 . DOI: 10.1007/978-3-030-69525-5_6 http://dx.doi.org/10.1007/978-3-030-69525-5_6

Zhao L , Peng X , Tian Y , et al . Semantic graph convolutional networks for 3D human pose regression [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2019 : 3420 - 3430 . DOI: 10.1109/cvpr.2019.00354 http://dx.doi.org/10.1109/cvpr.2019.00354

Hu W B , Zhang C G , Zhan F N , et al . Conditional directed graph convolution for 3D human pose estimation [C ] // Proceedings of the 29th ACM International Conference on Multimedia . New York : ACM , 2021 : 602 - 611 . DOI: 10.1145/3474085.3475219 http://dx.doi.org/10.1145/3474085.3475219

Smith A C , Brown E N . Estimating a state-space model from point process observations [J ] . Neural Computation , 2003 , 15 ( 5 ): 965 - 991 . DOI: 10.1162/089976603765202622 http://dx.doi.org/10.1162/089976603765202622

Gu A , Dao T . Mamba: Linear-time sequence modeling with selective state spaces [PP/OL ] . V2.arXiv ( 2024-05-31 )[ 2025-09-28 ] . https://doi.org/10.48550/arXiv.2312.00752 https://doi.org/10.48550/arXiv.2312.00752 .

Dao T , Gu A , Hwang S , et al . Hydra: Bidirectional state space models through generalized matrix mixers [C ] // Advances in Neural Information Processing Systems 37 . Neural Information Processing Systems Foundation, Inc. (NeurIPS) , 2024 : 110876 - 110908 . DOI: 10.52202/079017-3520 http://dx.doi.org/10.52202/079017-3520

ZHANG Z . Group graph convolutional networks for 3D human pose estimation [C ] // Proceedings of the British Machine Vision Conference . London : BMVA Press , 2022 : 1019 .

Zou Z M , Tang W . Modulated graph convolutional network for 3D human pose estimation [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2021 : 11457 - 11467 . DOI: 10.1109/iccv48922.2021.01128 http://dx.doi.org/10.1109/iccv48922.2021.01128

Yu B X B , Zhang Z , Liu Y X , et al . GLA-GCN: Global-local adaptive graph convolutional network for 3D human pose estimation from monocular video [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 8784 - 8795 . DOI: 10.1109/iccv51070.2023.00810 http://dx.doi.org/10.1109/iccv51070.2023.00810

Gu A , Goel K , Ré C . Efficiently modeling long sequences with structured state spaces [PP/OL ] . V3. arXiv ( 2022-08-05 )[ 2025-09-28 ] . https://doi.org/10.48550/arXiv.2111.00396 https://doi.org/10.48550/arXiv.2111.00396 .

Zhu L H , Liao B C , Zhang Q , et al . Vision mamba: Efficient visual representation learning with bidirectional state space model [PP/OL ] . V3. arXiv ( 2024-11-14 )[ 2025-09-28 ] . https://doi.org/10.48550/arXiv.2401.09417 https://doi.org/10.48550/arXiv.2401.09417 .

Jiao J B , Liu Y , Liu Y F , et al . VMamba: Visual state space model [C ] // Advances in Neural Information Processing Systems 37 . Neural Information Processing Systems Foundation, Inc. (NeurIPS) , 2024 : 103031 - 103063 . DOI: 10.52202/079017-3273 http://dx.doi.org/10.52202/079017-3273

Huang Y L , Liu J S , Xian K , et al . PoseMamba: Monocular 3D human pose estimation with bidirectional global-local spatio-temporal state space model [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2025 , 39 ( 4 ): 3842 - 3850 . DOI: 10.1609/aaai.v39i4.32401 http://dx.doi.org/10.1609/aaai.v39i4.32401

Zhu W T , Ma X X , Liu Z Y , et al . MotionBERT: A unified perspective on learning human motion representations [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 15039 - 15053 . DOI: 10.1109/iccv51070.2023.01385 http://dx.doi.org/10.1109/iccv51070.2023.01385

Loshchilov I , Hutter F . Decoupled weight decay regularization [PP/OL ] . V3. arXiv ( 2019-01-04 )[ 2025-09-28 ] . https://doi.org/10.48550/arXiv.1711.05101 https://doi.org/10.48550/arXiv.1711.05101 .

Zhao Q T , Zheng C , Liu M Y , et al . PoseFormerV2: Exploring frequency domain for efficient and robust 3D human pose estimation [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 8877 - 8886 . DOI: 10.1109/cvpr52729.2023.00857 http://dx.doi.org/10.1109/cvpr52729.2023.00857

Zhang J L , Tu Z G , Yang J Y , et al . MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 13222 - 13232 . DOI: 10.1109/cvpr52688.2022.01288 http://dx.doi.org/10.1109/cvpr52688.2022.01288

KIM J H , HAN J , LEE S W . PoseAnchor: Robust root position estimation for 3D human pose estimation [C ] // Proceedings of the IEEE/CVF International Conference on Computer Vision . New York : IEEE , 2025 : 7079 - 7088 . DOI: 10.1109/iccv51701.2025.00665 http://dx.doi.org/10.1109/iccv51701.2025.00665

Peng J H , Zhou Y H , Mok P Y . KTPFormer: Kinematics and trajectory prior knowledge-enhanced transformer for 3D human pose estimation [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 1123 - 1132 . DOI: 10.1109/cvpr52733.2024.00113 http://dx.doi.org/10.1109/cvpr52733.2024.00113

Zheng K L , Lu F X , Lv Y H , et al . 3D human pose estimation via non-causal retentive networks [M ] // Computer Vision-ECCV 2024 . ChamSpringer Nature Switzerland2024: 111 - 128 . DOI: 10.1007/978-3-031-73414-4_7 http://dx.doi.org/10.1007/978-3-031-73414-4_7

Li W H , Liu H , Tang H , et al . MHFormer: Multi-hypothesis transformer for 3D human pose estimation [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 13137 - 13146 . DOI: 10.1109/cvpr52688.2022.01280 http://dx.doi.org/10.1109/cvpr52688.2022.01280

Tang Z H , Hao Y B , Li J , et al . FTCM: Frequency-temporal collaborative module for efficient 3D human pose estimation in video [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2024 , 34 ( 2 ): 911 - 923 . DOI: 10.1109/tcsvt.2023.3286402 http://dx.doi.org/10.1109/tcsvt.2023.3286402

Shan W K , Liu Z H , Zhang X F , et al . P-STMO: Pre-trained spatial temporal many-to-one model for3D human pose estimation [C ] // Computer Vision-ECCV 2022 . Cham : Springer , 2022 : 461 - 478 . DOI: 10.1007/978-3-031-20065-6_27 http://dx.doi.org/10.1007/978-3-031-20065-6_27

Chen H Y , He J Y , Xiang W M , et al . HDFormer: High-order directed transformer for 3D human pose estimation [PP/OL ] . V2. arXiv ( 2023-05-22 )[ 2025-09-28 ] . https://doi.org/10.48550/arXiv.2302.01825 https://doi.org/10.48550/arXiv.2302.01825 .

Shan W K , Liu Z H , Zhang X F , et al . Diffusion-based 3D human pose estimation with multi-hypothesis aggregation [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 14715 - 14725 . DOI: 10.1109/iccv51070.2023.01356 http://dx.doi.org/10.1109/iccv51070.2023.01356

Tang Z H , Qiu Z F , Hao Y B , et al . 3D human pose estimation with spatio-temporal criss-cross attention [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 4790 - 4799 . DOI: 10.1109/cvpr52729.2023.00464 http://dx.doi.org/10.1109/cvpr52729.2023.00464

Li W H , Liu H , Ding R W , et al . Exploiting temporal contexts with strided transformer for 3D human pose estimation [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 1282 - 1293 . DOI: 10.1109/tmm.2022.3141231 http://dx.doi.org/10.1109/tmm.2022.3141231

Zhang J L , Chen Y J , Tu Z G . Uncertainty-aware 3D human pose estimation from monocular video [C ] // Proceedings of the 30th ACM International Conference on Multimedia . New York : ACM , 2022 : 5102 - 5113 . DOI: 10.1145/3503161.3547773 http://dx.doi.org/10.1145/3503161.3547773

Mehraban S , Adeli V , Taati B . MotionAGFormer: Enhancing 3D human pose estimation with a transformer-GCNFormer network [C ] // 2024 IEEE/CVF Winter Conference on Applications of Computer Vision . Piscataway : IEEE , 2024 : 6905 - 6915 . DOI: 10.1109/wacv57701.2024.00677 http://dx.doi.org/10.1109/wacv57701.2024.00677

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

A Few-Shot Open-Set Recognition Method for Radar Target HRRP Based on Multi-Hypersphere Spatial Topological Constraints

A Method for Enhancing the Transferability of Adversarial Examples Based on Multi-Perspective Confidence Fusion

SDDA: Unsupervised Style and Distribution Domain Adaptation Method for Nighttime Semantic Segmentation

Fabric Defect Detection Method Based on GAN-Data Augmentation and Improved RT-DETR

Adversarial Mixture of Experts Post-Training for Robust AI-Generated Image Detection

Related Author

LI Jun

CHEN Li

XU Hanzheng

LIU Zheng

XU Shuwen

GUO Zekun

ZHAO Changfei

DENG Xinyang

Related Institution

Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, , Hubei

National Key Laboratory of Radar Signal Processing, Xidian University

School of Electronics and Information, Northwestern Polytechnical University

Aerospace Information Research Institute, Chinese Academy of Sciences

School of Computer and Information Security, Guilin University of Electronic Technology

⁰