Hierarchical Text Semantics-Driven Multi-Granularity Human Motion Generation

SHU Xiangbo; LI Chengjian; YIN Zheng; LI Pengpeng; LI Zechao; TANG Jinhui

doi:10.12263/DZXB.20251089

您当前的位置：

首页 >

文章列表页 >

Hierarchical Text Semantics-Driven Multi-Granularity Human Motion Generation

PAPERS | 更新时间：2026-06-04

- Hierarchical Text Semantics-Driven Multi-Granularity Human Motion Generation
- ACTA ELECTRONICA SINICA Vol. 54, Issue 1, Pages: 451-465(2026)
- 作者机构：
  
  1.南京理工大学计算机科学与工程学院，江苏南京 210094
  2.南京林业大学信息科学与技术学院&人工智能学院，江苏南京 210037
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China Major Research Instrument Development Program(62427808);National Natural Science Foundation of China Regional Innovation Development Joint Fund(U25A20442);National Natural Science Foundation of China Excellent Young Scientists Fund(62222207);National Science Fund for Distinguished Young Scholars(62425603)
- DOI：10.12263/DZXB.20251089
  CLC： TP391;
- Received：13 December 2025，
  
  Accepted：06 January 2026，
  
  Published：25 January 2026
- 稿件说明：
移动端阅览
舒祥波, 李成建, 尹政, 等. 层次化文本语义驱动的多粒度人体行为生成[J]. 电子学报, 2026, 54(01): 451-465.

SHU Xiangbo, LI Chengjian, YIN Zheng, et al. Hierarchical Text Semantics-Driven Multi-Granularity Human Motion Generation[J]. Acta Electronica Sinica, 2026, 54(01): 451-465.
舒祥波, 李成建, 尹政, 等. 层次化文本语义驱动的多粒度人体行为生成[J]. 电子学报, 2026, 54(01): 451-465. DOI：10.12263/DZXB.20251089

SHU Xiangbo, LI Chengjian, YIN Zheng, et al. Hierarchical Text Semantics-Driven Multi-Granularity Human Motion Generation[J]. Acta Electronica Sinica, 2026, 54(01): 451-465. DOI：10.12263/DZXB.20251089

摘要

当前的人体行为生成方法在生成文本描述与行为一致的高质量运动方面仍面临挑战。尽管近年来基于扩散模型、自回归模型以及多模态预训练模型的方法在运动自然性和多样性上取得了一定进展，但在复杂文本语义理解和精细动作建模方面仍存在明显不足。其主要原因包括：（1）缺乏句子成分间层次依赖关系建模会导致模型文本语义理解困难；（2）现有方法仅在全局级或单词级进行文本-行为之间跨模态对齐，忽视了全局与局部信息之间的互补性导致粗细粒度协同建模困难。为此，本文提出了一种层次化文本语义驱动的多粒度人体行为生成框架（Hierarchical Textual-semantic-driven Multi-Granularity human motion generation framework，HTMG），该框架在全面理解文本语义的同时实现了粗细粒度的跨模态交互，从而实现文本-行为的一致性。具体而言，为了解决文本语义理解难题，本文提出了一种层次化语义捕捉策略（Hierarchical Semantic Capture Strategy，HSCS），该策略通过句法分析构建文本结构树显式建模单词间依存关系并引入双曲图注意力机制（Hyperbolic Graph ATtention mechanism，HGAT）在双曲空间动态捕捉层次语义依赖，从而显著提升模型的语义理解能力。此外，为了实现粗细粒度的跨模态对齐，本文设计一种多粒度跨模态注意力机制（Multi-Granularity Cross-modal Attention mechanism，MGCA），通过将全局级语义表示与单词级局部语义表示分别与人体行为特征进行自适应交叉融合，使模型在生成过程中能够同时关注整体动作意图与局部动作变化，从而实现语义一致的多粒度动作建模。大量实验结果表明，本文提出的HTMG在HumanML3D和KIT-ML数据集上均取得了最优性能，充分验证了该框架在文本语义理解与文本-行为一致性建模方面的有效性。

Abstract

Generating high-quality human motions that are semantically consistent with textual descriptions remains a challenging problem. Although recent diffusion-based

autoregressive

and multimodal pre-trained approaches have improved motion naturalness and diversity

they still struggle with complex semantic understanding and fine-grained motion modeling. These limitations mainly stem from two factors: (1) the lack of explicit modeling of hierarchical dependency relationships among sentence components

which hampers accurate textual semantic understanding; (2) the reliance on either global-level or word-level text-motion alignment

while neglecting the complementarity between global and local semantics

making coarse-to-fine collaborative modeling difficult. To address these limits

we propose the hierarchical textual-semantic-driven multi-granularity human motion generation framework (HTMG)

which models textual semantics while enabling coarse-to-fine cross-modal interactions to ensure text-motion consistency. Specifically

we introduce a hierarchical semantic capture strategy (HSCS) that constructs a textual structure tree via syntactic parsing and embeds it into hyperbolic space

where hierarchical semantic dependencies are dynamically modeled using a hyperbolic graph attention mechanism. Furthermore

we design a multi-granularity cross-modal attention mechanism (MGCA) that adaptively fuses global-level and word-level semantic representations with motion features

allowing the model to jointly capture overall motion intent and fine-grained action variations. Extensive experiments demonstrate that HTMG achieves state-of-the-art performance on the HumanML3D and KIT-ML benchmarks

validating the effectiveness of our framework in textual semantic understanding and text-motion alignment.

关键词

Keywords

references

Huang , Yiheng , Yang , Hui , Luo , Chuanchen , et al . StableMoFusion: Towards robust and efficient diffusion-based motion generation framework [PP/OL ] . V2.arXiv ( 2024-12-09 )[ 2025-11-24 ] . https://doi.org/10.48550/arXiv.2405.05691 https://doi.org/10.48550/arXiv.2405.05691 .

Yin Wang , Leng Zhiying , Li F W B , et al . Fg-T2M: Fine-grained text-driven human motion generation via diffusion model [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 21978 - 21987 . DOI: 10.1109/iccv51070.2023.02014 http://dx.doi.org/10.1109/iccv51070.2023.02014

Guo Chuan , Mu Yuxuan , Javed M G , et al . MoMask: Generative masked modeling of 3D human motions [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 1900 - 1910 . DOI: 10.1109/cvpr52733.2024.00186 http://dx.doi.org/10.1109/cvpr52733.2024.00186

Li Chengjian , Shu Xiangbo , Cui Qingjie , et al . FTMoMamba: Motion generation with frequency and text state space models [PP/OL ] . V1.arXiv ( 2024-11-26 )[ 2025-01-24 ] . https://doi.org/10.48550/arXiv.2411.17532 https://doi.org/10.48550/arXiv.2411.17532 .

Javed M G , Guo Chuan , Cheng Li , et al . InterMask: 3D human interaction generation via collaborative masked modeling [PP/OL ] . V3.arXiv ( 2025-03-02 )[ 2024-11-28 ] . https://doi.org/10.48550/arXiv.2410.10010 https://doi.org/10.48550/arXiv.2410.10010 .

Jeong M , Hwang Y , Lee J , et al . HGM3: Hierarchical generative masked mo- tion modeling with hard token mining [C ] // The Thirteenth International Conference on Learning Representations (ICLR) . 2025 .

Petrovich M , Black M J , Varol G . TEMOS: Generating diverse human motions from textual descriptions [M ] // Computer Vision - ECCV 2022 . Cham : Springer Nature Switzerland , 2022 : 480 - 497 . DOI: 10.1007/978-3-031-20047-2_28 http://dx.doi.org/10.1007/978-3-031-20047-2_28

Tevet G , Gordon B , Hertz A , et al . MotionCLIP: Exposing human motion generation to CLIP space [M ] // Computer Vision - ECCV 2022 . Cham : Springer Nature Switzerland , 2022 : 358 - 374 . DOI: 10.1007/978-3-031-20047-2_21 http://dx.doi.org/10.1007/978-3-031-20047-2_21

Barsoum E , Kender J , Liu Zicheng . HP-GAN: Probabilistic 3D human motion prediction via GAN [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2018 : 1499 - 149909 . DOI: 10.1109/cvprw.2018.00191 http://dx.doi.org/10.1109/cvprw.2018.00191

Harvey F G , Yurick M , Nowrouzezahrai D , et al . Robust motion in-betweening [J ] . ACM Transactions on Graphics , 2020 , 39 ( 4 ): 1 - 12 . DOI: 10.1145/3386569.3392480 http://dx.doi.org/10.1145/3386569.3392480

Zhong Chongyang , Hu Lei , Zhang Zihao , et al . AttT2M: Text-driven human motion generation with multi-perspective attention mechanism [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 509 - 519 . DOI: 10.1109/iccv51070.2023.00053 http://dx.doi.org/10.1109/iccv51070.2023.00053

Pinyoanuntapong E , Wang Pu , Lee M , et al . MMM: Generative masked motion model [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 1546 - 1555 . DOI: 10.1109/cvpr52733.2024.00153 http://dx.doi.org/10.1109/cvpr52733.2024.00153

Pinyoanuntapong E , Saleem M U , Wang Pu , et al . BAMM: Bidirectional autoregressive motion model [M ] // Computer Vision - ECCV 2024 . Cham : Springer Nature Switzerland , 2024 : 172 - 190 . DOI: 10.1007/978-3-031-72633-0_10 http://dx.doi.org/10.1007/978-3-031-72633-0_10

Gong Kehong , Lian Dongze , Chang Heng , et al . TM2D: Bimodality driven 3D dance generation via music-text integration [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 9908 - 9918 . DOI: 10.1109/iccv51070.2023.00912 http://dx.doi.org/10.1109/iccv51070.2023.00912

Zhang Zeyu , Liu A , Reid I , et al . Motion mamba: Efficient and long sequence motion generation [M ] // Computer Vision - ECCV 2024 . Cham : Springer Nature Switzerland , 2024 : 265 - 282 . DOI: 10.1007/978-3-031-73232-4_15 http://dx.doi.org/10.1007/978-3-031-73232-4_15

Zhang Mingyuan , Cai Zhongang , Pan Liang , et al . MotionDiffuse: Text-driven human motion generation with diffusion model [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024 , 46 ( 6 ): 4115 - 4128 . DOI: 10.1109/tpami.2024.3355414 http://dx.doi.org/10.1109/tpami.2024.3355414

Wei Mingjie , Xie Xuemei , Shi Guangming . ACMo: Attribute controllable motion generation [PP/OL ] . V1.arXiv ( 2025-03-14 )[ 2025-07-08 ] . https://doi.org/10.48550/arXiv.2503.11038 https://doi.org/10.48550/arXiv.2503.11038 .

Zheng Bowen , Chen Ke , Yao Yuxin , et al . AutoKeyframe: Autoregressive keyframe generation for human motion synthesis and editing [C ] // Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers . New York : ACM , 2025 : 3730664 . DOI: 10.1145/3721238.3730664 http://dx.doi.org/10.1145/3721238.3730664

Wu Bizhu , Xie Jinheng , Ding Meidanet al . FineMotion: A dataset and benchmark with both spatial and temporal annotation for fine-grained motion generation and editing [PP/OL ] . V1.arXiv ( 2025-07-26 )[ 2025-07-29 ] . https://doi.org/10.48550/arXiv.2507.19850 https://doi.org/10.48550/arXiv.2507.19850 .

Zhong Lei , Yang Yi , Li Changjian . SMooGPT: Stylized motion generation using large language models [PP/OL ] . V2.arXiv ( 2026-01-26 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.2509.04058 https://doi.org/10.48550/arXiv.2509.04058 .

Chen Xin , Jiang Biao , Liu Wen , et al . Executing your commands via motion diffusion in latent space [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 18000 - 18010 . DOI: 10.1109/cvpr52729.2023.01726 http://dx.doi.org/10.1109/cvpr52729.2023.01726

RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [C/OL ] // 2021 International Conference on Machine Learning(ICML) . PMLR , 2021 : 8748 - 8763 . DOI: 10.48550/arXiv.2103.00020 http://dx.doi.org/10.48550/arXiv.2103.00020

Nivre J . Algorithms for deterministic incremental dependency parsing [J ] . Computational Linguistics , 2008 , 34 ( 4 ): 513 - 553 . DOI: 10.1162/coli.07-056-r1-07-027 http://dx.doi.org/10.1162/coli.07-056-r1-07-027

He N , Anand R , Madhu H , et al . HELM: Hyperbolic large language models via mixture-of-curvature experts [PP/OL ] . V2. arXiv ( 2025-11-06 )[ 2025-10-09 ] . https://doi.org/10.48550/arXiv.2505.24722 https://doi.org/10.48550/arXiv.2505.24722 .

Chami I , Ying R , Re C , et al . Hyperbolic graph convolutional neural networks [C ] // Proceedings of the 33rd International Conference on Neural Information Processing Systems . New York : ACM , 2019 : 4868 - 4879 .

Li Jun , Wang Jinpeng , Tan C L , et al . HLFormer: Enhancing partially relevant video retrieval with hyperbolic learning [PP/OL ] . V2.arXiv ( 2025-07-27 )[ 2025-10-25 ] . https://doi.org/10.48550/arXiv.2507.17402 https://doi.org/10.48550/arXiv.2507.17402 .

Vaswani A , Shazeer N , Parmar N , et al . Attention is all you need [J ] // Advances in Neural Information Processing Systems . 2017 . 30 . DOI: 10.3390/rs9080848 http://dx.doi.org/10.3390/rs9080848

Gulrajani I , Ahmed F , Arjovsky M , et al . Improved training of wasserstein GANs [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 5769 - 5779 .

Van den oord A , Vinyals O , Kavukcuoglu K . Neural discrete representation learning [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 6309 - 6318 . DOI: 10.48550/arXiv.1711.00937 http://dx.doi.org/10.48550/arXiv.1711.00937

Foundations of hyperbolic manifolds [EB/OL ] . [ 2025-10-25 ] . https://link.springer.com/book/10.1007/978-1-4757-4013-4 https://link.springer.com/book/10.1007/978-1-4757-4013-4 . DOI: 10.1007/978-0-387-47322-2_11 http://dx.doi.org/10.1007/978-0-387-47322-2_11

Zhang Y D , Wang X , Jiang X Q , et al . Hyperbolic graph attention network [PP/OL ] . V1. arXiv ( 2019-12-06 )[ 2025-11-07 ] . https://doi.org/10.48550/arXiv.1912.03046 https://doi.org/10.48550/arXiv.1912.03046 .

Pal A , van Spengler M , di Melendugno G M D , et al . Compositional entailment learning for hyperbolic vision-language models [PP/OL ] . V2.arXiv ( 2025-03-01 )[ 2025-10-08 ] . https://doi.org/10.48550/arXiv.2410.06912 https://doi.org/10.48550/arXiv.2410.06912 .

Li Yue , Qu Haoxuan , Liu Mengyuan , et al . HyLiFormer: Hyperbolic linear attention for skeleton-based human action recognition [PP/OL ] . V1.arXiv ( 2025-02-09 )[ 2025-10-09 ] . https://doi.org/10.48550/arXiv.2502.05869 https://doi.org/10.48550/arXiv.2502.05869 .

Peng Zelin , Xu Zhengqin , Liu Qingyang , et al . HyperET: Efficient training in hyperbolic space for multi-modal large language models [PP/OL ] . V3.arXiv ( 2025-12-18 )[ 2025-10-24 ] . https://doi.org/10.48550/arXiv.2510.20322 https://doi.org/10.48550/arXiv.2510.20322 .

Mandica P , Franco L , Kallidromitis K , et al . Hyperbolic learning with multimodal large language models [PP/OL ] . V1.arXiv ( 2024-08-09 )[ 2025-10-09 ] . https://doi.org/10.48550/arXiv.2408.05097 https://doi.org/10.48550/arXiv.2408.05097 .

王彩霞 , 安琪 , 周鸿策 , 等 . 基于特征自适应选取的视觉目标跟踪算法 [J ] . 电子学报 , 2025 , 53 ( 8 ): 2879 - 2898 .

Wang Caixiang , An Qi , Zhou Hongce , et al . Visual object tracking algorithm based on adaptive feature selection [J ] . Acta Electronica Sinica , 2025 , 53 ( 8 ): 2879 - 2898 . (in Chinese)

秦钰淑 , 杨良怀 , 朱艳超 , 等 . 融合图像与文本特征的组合检索方法 [J ] . 电子学报 , 2025 , 53 ( 2 ): 558 - 567 .

Qin Yushu , Yang Lianghuai , Zhu Yanchao , et al . A combined retrieval method by fusing image and text features [J ] . Acta Electronica Sinica , 2025 , 53 ( 2 ): 558 - 567 . (in Chinese)

Touvron H , Lavril T , Izacard G , et al . LLaMA: Open and efficient foundation language models [PP/OL ] . V1.arXiv ( 2023-02-27 )[ 2025-10-25 ] . https://doi.org/10.48550/arXiv.2302.13971 https://doi.org/10.48550/arXiv.2302.13971 .

Liu Aixin , Feng Bei , Xue Bing , et al . DeepSeek-V3 technical report [PP/OL ] . V2.arXiv ( 2025-02-18 )[ 2025-10-25 ] . https://doi.org/10.48550/arXiv.2412.19437 https://doi.org/10.48550/arXiv.2412.19437 .

李豪 , 郝文宁 , 邹世辰 , 等 . 基于Diffusion-Mamba和尺度不变损失的渐进式图像生成方法 [J ] . 电子学报 , 2025 , 53 ( 9 ): 3384 - 3396 .

Li Hao , Hao Wenning , Zou Shichen , et al . Progressive image synthesis method based on diffusion-mamba and scale-invariant loss [J ] . Acta Electronica Sinica , 2025 , 53 ( 9 ): 3384 - 3396 . (in Chinese)

Muscoloni A , Thomas J M , Ciucci S , et al . Machine learning meets complex networks via coalescent embedding in the hyperbolic space [J ] . Nature Communications , 2017 , 8 : 1615 . DOI: 10.1038/s41467-017-01825-5 http://dx.doi.org/10.1038/s41467-017-01825-5

Clauset A , Moore C , Newman M E J . Hierarchical structure and the prediction of missing links in networks [J ] . Nature , 2008 , 453 ( 7191 ): 98 - 101 . DOI: 10.1038/nature06830 http://dx.doi.org/10.1038/nature06830

Krioukov D , Papadopoulos F , Kitsak M , et al . Hyperbolic geometry of complex networks [J ] . Physical Review E , 2010 , 82 ( 3 ): 036106 . DOI: 10.1103/physreve.82.036106 http://dx.doi.org/10.1103/physreve.82.036106

Tevet G , Raab S , Gordon B , et al . Human motion diffusion model [PP/OL ] . V2.arXiv ( 2022-10-03 )[ 2025-11-06 ] . https://doi.org/10.48550/arXiv.2209.14916 https://doi.org/10.48550/arXiv.2209.14916 .

Guo Chuan , Zou Shihao , Zuo Xinxin , et al . Generating diverse and natural 3D human motions from text [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 5142 - 5151 . DOI: 10.1109/cvpr52688.2022.00509 http://dx.doi.org/10.1109/cvpr52688.2022.00509

Plappert M , Mandery C , Asfour T . The KIT motion-language dataset [J ] . Big Data , 2016 , 4 ( 4 ): 236 - 252 . DOI: 10.1089/big.2016.0028 http://dx.doi.org/10.1089/big.2016.0028

HO J , JAIN A , ABBEEL P . Denoising Diffusion Probabilistic Models [J ] . Advances in neural information processing systems , 2020 , 33 . DOI: 10.48550/arXiv.2006.11239 http://dx.doi.org/10.48550/arXiv.2006.11239

Zhang Jianrong , Zhang Yangsong , Xiaodong Cun , et al . Generating human motion from textual descriptions with discrete representations [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 14730 - 14740 . DOI: 10.1109/cvpr52729.2023.01415 http://dx.doi.org/10.1109/cvpr52729.2023.01415

Jiang Biao , Chen Xin , Liu Wen , et al . MotionGPT: Human motion as a foreign language [C ] // Advances in Neural Information Processing Systems 36 . Neural Information Processing Systems Foundation, Inc. (NeurIPS) , 2023 : 20067 - 20079 . DOI: 10.52202/075280-0880 http://dx.doi.org/10.52202/075280-0880

Wu Bizhu , Xie Jinheng , Shen Keming , et al . MG-MotionLLM: A unified framework for motion comprehension and generation across multiple granularities [C ] // 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2025 : 27849 - 27858 . DOI: 10.1109/cvpr52734.2025.02593 http://dx.doi.org/10.1109/cvpr52734.2025.02593

Wu Qi , Zhao Yubo , Wang Yifan , et al . Motion-agent: A conversational framework for human motion generation with LLMs [PP/OL ] . V3.arXiv ( 2024-10-06 )[ 2025-11-17 ] . https://doi.org/10.48550/arXiv.2405.17013 https://doi.org/10.48550/arXiv.2405.17013 .

Dai Wenxun , Chen Ling-Hao , Wang Jingbo , et al . MotionLCM: Real-time controllable motion generation via latent consistency model [M ] // Computer Vision - ECCV 2024 . Cham : Springer Nature Switzerland , 2024 : 390 - 408 . DOI: 10.1007/978-3-031-72640-8_22 http://dx.doi.org/10.1007/978-3-031-72640-8_22

CHO J , KIM J , KIM J , et al . DisCoRD: Discrete tokens to continuous motion via rectified flow decoding [C ] // 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) . Piscataway : IEEE , 2025 : 14602 - 14612 . DOI: 10.1109/iccv51701.2025.01355 http://dx.doi.org/10.1109/iccv51701.2025.01355

Meng Zichong , Xie Yiming , Peng Xiaogang , et al . Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression [C ] // 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2025 : 27859 - 27871 . DOI: 10.1109/cvpr52734.2025.02594 http://dx.doi.org/10.1109/cvpr52734.2025.02594

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

RGE-Pipeline: Fast Evaluation for LLM-Based Retrieval-Augmented Generation Systems

Attention Penalty and Adaptive Learning Scene Graph for Joint Multimodal Aspect-Based Sentiment Analysis

Identification and Migration of Silent Security Patches in Blockchain Systems via Information Fusion

Spatial-FineDef: An Approach for Detecting Small Defects in Wind Turbine Blades that Integrate Multi-Scale Perception and Adaptive Enhancement

Image Classification Network with Attenuation Disentangling Mechanism

Related Author

No data

Related Institution

No data

⁰