

浏览全部资源
扫码关注微信
1.武汉科技大学计算机科学与技术学院,湖北武汉 430065
2.智能信息处理与实时工业系统湖北省重点实验室,湖北武汉 430065
Received:28 September 2025,
Accepted:28 February 2026,
Published:25 March 2026
移动端阅览
李俊, 李昱, 陈黎. MSGPose:基于多语义图卷积与图引导状态空间模型的单目单人三维人体姿态估计[J]. 电子学报, 2026, 54(03): 1118-1131.
LI Jun, LI Yu, CHEN Li. MSGPose: Monocular Single-Person 3D Human Pose Estimation Via Multi-Semantic Graph Convolution and Graph-Guided State Space Models[J]. Acta Electronica Sinica, 2026, 54(03): 1118-1131.
李俊, 李昱, 陈黎. MSGPose:基于多语义图卷积与图引导状态空间模型的单目单人三维人体姿态估计[J]. 电子学报, 2026, 54(03): 1118-1131. DOI:10.12263/DZXB.20250860
LI Jun, LI Yu, CHEN Li. MSGPose: Monocular Single-Person 3D Human Pose Estimation Via Multi-Semantic Graph Convolution and Graph-Guided State Space Models[J]. Acta Electronica Sinica, 2026, 54(03): 1118-1131. DOI:10.12263/DZXB.20250860
单目单人三维人体姿态估计在动作识别与人机交互等领域极具应用价值。然而,受固有深度模糊、严重自遮挡与成像噪声等因素影响,从含有误差的2D观测中鲁棒地恢复3D姿态仍是一大挑战。针对现有基于图卷积神经网络(Graph Convolutional Network, GCN)方法过度依赖单一且静态的物理骨架拓扑,难以充分表达左右对称性等非物理连接语义,以及基于自注意力的方法在长序列下因二次方复杂度导致计算冗余与参数量过大的问题,本文提出MSGPose,一种基于多语义动态分离图卷积(Multi-Semantic Dynamic Separable Graph Convolution,MSDG)与语义图引导的时空双向Mamba(Semantic Graph-guided Mamba block,SGM)的双流并行框架,用于联合建模并提取2D姿态序列的空间与时间特征。具体而言,MSDG模块不仅通过自连接、物理连接与左右对称性先验构建多层级语义图,为各语义分支配置独立权重以避免特征耦合,还引入带权修正矩阵动态缓解固定拓扑的归纳偏置。随后,结合基于K近邻策略构建的稀疏动态时序图卷积,自适应捕获复杂运动下的跨关节与跨帧依赖。同时,为弥补Mamba架构在空间拓扑建模上的不足,SGM模块在双向状态空间扫描前引入多语义图卷积引导,将解剖学结构先验显式注入序列表示中,为后续的状态空间模型提供了一个具备局部几何感知能力的特征空间,从而更高效地进行长程时空依赖建模。在特征融合与优化阶段,通过可学习的自适应权重对两条特征流进行互补融合,并采用三维位置损失与速度损失进行联合训练,以增强预测姿态的时间一致性与稳定性。在Human3.6M数据集上,MSGPose取得了38.9 mm的平均每关节位置误差(Mean Per Joint Position Error,MPJPE),相较于MotionBERT降低了0.3 mm。值得注意的是,MSGPose展现出极佳的参数效率,其参数量(13.3 M)仅约为MotionBERT的31%。同时,在场景更复杂的MPI-INF-3DHP数据集上,MSGPose的MPJPE降至14.5 mm,相较于MotionAGFormer进一步下降了1.7 mm。此外,在无噪声的真实2D标注输入下,MSGPose进一步将Human3.6M上的MPJPE显著降至12.7 mm。这充分验证了多语义先验与双流结构的结合,能有效提升2D到3D的姿态回归能力。
Monocular single-person 3D human pose estimation holds immense value in action recognition and human-computer interaction. However
the lack of depth cues in monocular setups introduces inherent depth ambiguity
while self-occlusion and imaging noise further complicate the task. Robustly recovering 3D poses from erroneous 2D observations remains a major challenge. Existing graph convolutional network (GCN) methods often abstract the human skeleton into a predefined physical graph and mainly rely on this fixed topology. Consequently
they fail to express non-physical semantic relations
such as bilateral symmetry. Conversely
self-attention-based methods excel at long-range modeling. Yet
they suffer from quadratic complexity in long sequences. This complexity leads to increased computational cost and parameter overhead. To address these limitations
we propose MSGPose. This is a dual-stream parallel framework for parameter-efficient monocular 3D human pose estimation. It integrates a multi-semantic dynamic separable graph convolution (MSDG) module and a semantic graph-guided mamba (SGM) module. The framework jointly models and extracts spatial and temporal features from 2D pose sequences. The MSDG module tackles spatial and temporal relations. It constructs dynamic multi-level semantic graphs using three priors: self-connections
physical connections
and anatomical symmetry. MSDG assigns independent learnable weights to each semantic branch
which helps alleviate semantic feature coupling. Additionally
a weighted modification matrix dynamically mitigates the inductive bias of fixed topologies. For temporal dynamics
MSDG employs a sparse dynamic temporal graph convolution. It builds this graph using a K-Nearest Neighbors (K-NN) strategy based on feature similarity. This enables the modeling of cross-joint and inter-frame dependencies during complex movements. The SGM module addresses the spatial topology limitations of standard Mamba architectures. Flattening spatial tokens into 1D sequences may disrupt the natural topology of the human skeleton. To fix this
SGM introduces a multi-semantic graph convolution guidance mechanism. This mechanism operates before the causal 1D convolution and bidirectional state space scanning. This step explicitly injects anatomical structure priors into the sequence representation. It provides the subsequent state space model with a geometry-aware feature space. This enables efficient modeling of long-range spatio-temporal dependencies with linear complexity. During the feature fusion stage
MSGPose employs an adaptive mechanism. Learnable weights complementarily integrate the outputs from the two streams. The framework utilizes joint training optimized by 3D position and velocity losses. The velocity loss limits differences between adjacent frames. This strategy improves the temporal consistency and stability of the predicted poses. Extensive experiments demonstrate the effectiveness of MSGPose. On the Human3.6M dataset
it achieves a mean per joint position error (MPJPE) of 38.9 mm
representing a 0.3 mm improvement over MotionBERT while using only 13.3M parameters (approximately 31% of MotionBERT). On the challenging MPI-INF-3DHP dataset
MSGPose demonstrates strong generalization ability. It achieves an MPJPE of 14.5 mm
representing a 1.7 mm improvement over MotionAGFormer. Using noise-free ground-truth 2D annotations
the MPJPE on Human3.6M drops significantly to 12.7 mm. These results demonstrate the effectiveness of combining multi-semantic priors with a dual-stream architecture. This combination improves the performance of 2D-to-3D pose regression.
李雨桐 , 马苗 , 陈建芮 . 融合动作描述生成与跨模态语义对齐的骨架动作识别方法 [J ] . 电子学报 , 2025 , 53 ( 11 ): 4116 - 4131 .
Li Yutong , Ma Miao , Chen Jianrui . Leveraging action description generation and cross-modal semantic alignment for skeleton-based action recognition [J ] . Acta Electronica Sinica , 2025 , 53 ( 11 ): 4116 - 4131 . (in Chinese)
Zheng J X , Shi X W , Gorban A , et al . Multi-modal 3D human pose estimation with 2D weak supervision in autonomous driving [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2022 : 4477 - 4486 . DOI: 10.1109/cvprw56347.2022.00494 http://dx.doi.org/10.1109/cvprw56347.2022.00494
Wang K Z , Lin L , Jiang C H , et al . 3D human pose machines with self-supervised learning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2020 , 42 ( 5 ): 1069 - 1082 .
Peng S , Hu J W . 3D human pose estimation in video with temporal and spatial transformer [C ] // International Conference on Image, Signal Processing, and Pattern Recognition . SPIE , 2023 : 136 . DOI: 10.1117/12.2681195 http://dx.doi.org/10.1117/12.2681195
Chen Y L , Wang Z C , Peng Y X , et al . Cascaded pyramid network for multi-person pose estimation [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 7103 - 7112 . DOI: 10.1109/cvpr.2018.00742 http://dx.doi.org/10.1109/cvpr.2018.00742
Newell A , Yang K Y , Deng J . Stacked hourglass networks for human pose estimation [M ] // Computer Vision - ECCV 2016 . ChamSpringer International Publishing2016: 483 - 499 . DOI: 10.1007/978-3-319-46484-8_29 http://dx.doi.org/10.1007/978-3-319-46484-8_29
Sun K , Xiao B , Liu D , et al . Deep high-resolution representation learning for human pose estimation [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2019 : 5686 - 5696 . DOI: 10.1109/cvpr.2019.00584 http://dx.doi.org/10.1109/cvpr.2019.00584
Kipf T N , Welling M . Semi-supervised classification with graph convolutional networks [PP/OL ] . V4. arXiv ( 2017-02-22 )[ 2025-09-28 ] . https://doi.org/10.48550/arXiv.1609.02907 https://doi.org/10.48550/arXiv.1609.02907 .
Ci H , Wang C Y , Ma X X , et al . Optimizing network structure for 3D human pose estimation [C ] // 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2019 : 2262 - 2271 . DOI: 10.1109/iccv.2019.00235 http://dx.doi.org/10.1109/iccv.2019.00235
Liu K K , Zou Z M , Tang W . Learning global pose features in graph convolutional networks for 3D human pose estimation [M ] // Computer Vision - ACCV 2020 . ChamSpringer International Publishing 2021 : 89 - 105 . DOI: 10.1007/978-3-030-69525-5_6 http://dx.doi.org/10.1007/978-3-030-69525-5_6
Zhao L , Peng X , Tian Y , et al . Semantic graph convolutional networks for 3D human pose regression [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2019 : 3420 - 3430 . DOI: 10.1109/cvpr.2019.00354 http://dx.doi.org/10.1109/cvpr.2019.00354
Hu W B , Zhang C G , Zhan F N , et al . Conditional directed graph convolution for 3D human pose estimation [C ] // Proceedings of the 29th ACM International Conference on Multimedia . New York : ACM , 2021 : 602 - 611 . DOI: 10.1145/3474085.3475219 http://dx.doi.org/10.1145/3474085.3475219
Smith A C , Brown E N . Estimating a state-space model from point process observations [J ] . Neural Computation , 2003 , 15 ( 5 ): 965 - 991 . DOI: 10.1162/089976603765202622 http://dx.doi.org/10.1162/089976603765202622
Gu A , Dao T . Mamba: Linear-time sequence modeling with selective state spaces [PP/OL ] . V2.arXiv ( 2024-05-31 )[ 2025-09-28 ] . https://doi.org/10.48550/arXiv.2312.00752 https://doi.org/10.48550/arXiv.2312.00752 .
Dao T , Gu A , Hwang S , et al . Hydra: Bidirectional state space models through generalized matrix mixers [C ] // Advances in Neural Information Processing Systems 37 . Neural Information Processing Systems Foundation, Inc. (NeurIPS) , 2024 : 110876 - 110908 . DOI: 10.52202/079017-3520 http://dx.doi.org/10.52202/079017-3520
ZHANG Z . Group graph convolutional networks for 3D human pose estimation [C ] // Proceedings of the British Machine Vision Conference . London : BMVA Press , 2022 : 1019 .
Zou Z M , Tang W . Modulated graph convolutional network for 3D human pose estimation [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2021 : 11457 - 11467 . DOI: 10.1109/iccv48922.2021.01128 http://dx.doi.org/10.1109/iccv48922.2021.01128
Yu B X B , Zhang Z , Liu Y X , et al . GLA-GCN: Global-local adaptive graph convolutional network for 3D human pose estimation from monocular video [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 8784 - 8795 . DOI: 10.1109/iccv51070.2023.00810 http://dx.doi.org/10.1109/iccv51070.2023.00810
Gu A , Goel K , Ré C . Efficiently modeling long sequences with structured state spaces [PP/OL ] . V3. arXiv ( 2022-08-05 )[ 2025-09-28 ] . https://doi.org/10.48550/arXiv.2111.00396 https://doi.org/10.48550/arXiv.2111.00396 .
Zhu L H , Liao B C , Zhang Q , et al . Vision mamba: Efficient visual representation learning with bidirectional state space model [PP/OL ] . V3. arXiv ( 2024-11-14 )[ 2025-09-28 ] . https://doi.org/10.48550/arXiv.2401.09417 https://doi.org/10.48550/arXiv.2401.09417 .
Jiao J B , Liu Y , Liu Y F , et al . VMamba: Visual state space model [C ] // Advances in Neural Information Processing Systems 37 . Neural Information Processing Systems Foundation, Inc. (NeurIPS) , 2024 : 103031 - 103063 . DOI: 10.52202/079017-3273 http://dx.doi.org/10.52202/079017-3273
Huang Y L , Liu J S , Xian K , et al . PoseMamba: Monocular 3D human pose estimation with bidirectional global-local spatio-temporal state space model [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2025 , 39 ( 4 ): 3842 - 3850 . DOI: 10.1609/aaai.v39i4.32401 http://dx.doi.org/10.1609/aaai.v39i4.32401
Zhu W T , Ma X X , Liu Z Y , et al . MotionBERT: A unified perspective on learning human motion representations [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 15039 - 15053 . DOI: 10.1109/iccv51070.2023.01385 http://dx.doi.org/10.1109/iccv51070.2023.01385
Loshchilov I , Hutter F . Decoupled weight decay regularization [PP/OL ] . V3. arXiv ( 2019-01-04 )[ 2025-09-28 ] . https://doi.org/10.48550/arXiv.1711.05101 https://doi.org/10.48550/arXiv.1711.05101 .
Zhao Q T , Zheng C , Liu M Y , et al . PoseFormerV2: Exploring frequency domain for efficient and robust 3D human pose estimation [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 8877 - 8886 . DOI: 10.1109/cvpr52729.2023.00857 http://dx.doi.org/10.1109/cvpr52729.2023.00857
Zhang J L , Tu Z G , Yang J Y , et al . MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 13222 - 13232 . DOI: 10.1109/cvpr52688.2022.01288 http://dx.doi.org/10.1109/cvpr52688.2022.01288
KIM J H , HAN J , LEE S W . PoseAnchor: Robust root position estimation for 3D human pose estimation [C ] // Proceedings of the IEEE/CVF International Conference on Computer Vision . New York : IEEE , 2025 : 7079 - 7088 . DOI: 10.1109/iccv51701.2025.00665 http://dx.doi.org/10.1109/iccv51701.2025.00665
Peng J H , Zhou Y H , Mok P Y . KTPFormer: Kinematics and trajectory prior knowledge-enhanced transformer for 3D human pose estimation [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 1123 - 1132 . DOI: 10.1109/cvpr52733.2024.00113 http://dx.doi.org/10.1109/cvpr52733.2024.00113
Zheng K L , Lu F X , Lv Y H , et al . 3D human pose estimation via non-causal retentive networks [M ] // Computer Vision-ECCV 2024 . ChamSpringer Nature Switzerland2024: 111 - 128 . DOI: 10.1007/978-3-031-73414-4_7 http://dx.doi.org/10.1007/978-3-031-73414-4_7
Li W H , Liu H , Tang H , et al . MHFormer: Multi-hypothesis transformer for 3D human pose estimation [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 13137 - 13146 . DOI: 10.1109/cvpr52688.2022.01280 http://dx.doi.org/10.1109/cvpr52688.2022.01280
Tang Z H , Hao Y B , Li J , et al . FTCM: Frequency-temporal collaborative module for efficient 3D human pose estimation in video [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2024 , 34 ( 2 ): 911 - 923 . DOI: 10.1109/tcsvt.2023.3286402 http://dx.doi.org/10.1109/tcsvt.2023.3286402
Shan W K , Liu Z H , Zhang X F , et al . P-STMO: Pre-trained spatial temporal many-to-one model for3D human pose estimation [C ] // Computer Vision-ECCV 2022 . Cham : Springer , 2022 : 461 - 478 . DOI: 10.1007/978-3-031-20065-6_27 http://dx.doi.org/10.1007/978-3-031-20065-6_27
Chen H Y , He J Y , Xiang W M , et al . HDFormer: High-order directed transformer for 3D human pose estimation [PP/OL ] . V2. arXiv ( 2023-05-22 )[ 2025-09-28 ] . https://doi.org/10.48550/arXiv.2302.01825 https://doi.org/10.48550/arXiv.2302.01825 .
Shan W K , Liu Z H , Zhang X F , et al . Diffusion-based 3D human pose estimation with multi-hypothesis aggregation [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 14715 - 14725 . DOI: 10.1109/iccv51070.2023.01356 http://dx.doi.org/10.1109/iccv51070.2023.01356
Tang Z H , Qiu Z F , Hao Y B , et al . 3D human pose estimation with spatio-temporal criss-cross attention [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 4790 - 4799 . DOI: 10.1109/cvpr52729.2023.00464 http://dx.doi.org/10.1109/cvpr52729.2023.00464
Li W H , Liu H , Ding R W , et al . Exploiting temporal contexts with strided transformer for 3D human pose estimation [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 1282 - 1293 . DOI: 10.1109/tmm.2022.3141231 http://dx.doi.org/10.1109/tmm.2022.3141231
Zhang J L , Chen Y J , Tu Z G . Uncertainty-aware 3D human pose estimation from monocular video [C ] // Proceedings of the 30th ACM International Conference on Multimedia . New York : ACM , 2022 : 5102 - 5113 . DOI: 10.1145/3503161.3547773 http://dx.doi.org/10.1145/3503161.3547773
Mehraban S , Adeli V , Taati B . MotionAGFormer: Enhancing 3D human pose estimation with a transformer-GCNFormer network [C ] // 2024 IEEE/CVF Winter Conference on Applications of Computer Vision . Piscataway : IEEE , 2024 : 6905 - 6915 . DOI: 10.1109/wacv57701.2024.00677 http://dx.doi.org/10.1109/wacv57701.2024.00677
0
Views
4
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621