Leveraging Action Description Generation and Cross-Modal Semantic Alignment for Skeleton-Based Action Recognition

LI Yu-tong; MA Miao; CHEN Jian-rui

doi:10.12263/DZXB.20250652

您当前的位置：

首页 >

文章列表页 >

Leveraging Action Description Generation and Cross-Modal Semantic Alignment for Skeleton-Based Action Recognition

PAPERS | 更新时间：2026-02-10

- Leveraging Action Description Generation and Cross-Modal Semantic Alignment for Skeleton-Based Action Recognition
- ACTA ELECTRONICA SINICA Vol. 53, Issue 11, Pages: 4116-4131(2025)
- 作者机构：
  
  1.陕西师范大学人工智能与计算机学院，陕西西安 710119
  2.现代教学技术教育部重点实验室，陕西西安 710062
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62377031)
- DOI：10.12263/DZXB.20250652
  CLC： TP391.4
- Received：25 July 2025，
  
  Accepted：10 November 2025，
  
  Published：25 November 2025
- 稿件说明：
移动端阅览
李雨桐, 马苗, 陈建芮. 融合动作描述生成与跨模态语义对齐的骨架动作识别方法[J]. 电子学报, 2025, 53(11): 4116-4131.

LI Yu-tong, MA Miao, CHEN Jian-rui. Leveraging Action Description Generation and Cross-Modal Semantic Alignment for Skeleton-Based Action Recognition[J]. Acta Electronica Sinica, 2025, 53(11): 4116-4131.
李雨桐, 马苗, 陈建芮. 融合动作描述生成与跨模态语义对齐的骨架动作识别方法[J]. 电子学报, 2025, 53(11): 4116-4131. DOI：10.12263/DZXB.20250652

LI Yu-tong, MA Miao, CHEN Jian-rui. Leveraging Action Description Generation and Cross-Modal Semantic Alignment for Skeleton-Based Action Recognition[J]. Acta Electronica Sinica, 2025, 53(11): 4116-4131. DOI：10.12263/DZXB.20250652

摘要

动作识别旨在通过对人体动作的建模与分析，实现对人类行为的自动识别与理解，广泛应用于智能监控、人机交互、智慧教育等领域.近年来，自监督骨架动作识别方法因其计算成本低、适应能力强和标注数据依赖性小，逐渐成为动作识别的重要研究方向之一.然而现有方法多依赖模板提示生成动作概念的解释语句，存在时空结构信息缺失及语义建模能力有限问题，为此本文提出一种跨模态先验辅助的自监督骨架动作识别方法，旨在充分融合骨架结构特征与语义先验知识，实现更具语义理解能力的动作表征.该方法一方面利用双分支解耦骨架编码器分别建模动作的空间结构与时间信息，结合跨域对比学习策略，从空间、时间及全局视角建立特征对齐与一致性约束，以获得具有丰富时空结构和全局信息的骨架模态特征；另一方面将时序拼接的动作图像和提示指令共同输入视觉语言模型（Vision-Language Model，VLM）生成动作描述，并利用对比语言-图像预训练（Contrastive Language-Image Pre-training，CLIP）模型的文本编码器提取包含动作语义的文本特征，从而弥补单一骨架模态在细粒度语义表示上的不足；在此基础上，通过骨架调制文本的跨模态对比学习策略，在骨架特征引导下利用特征线性调制（Feature-wise Linear Modulation，FiLM）机制动态调控文本语义，实现骨架、文本信息的跨模态语义对齐.实验结果表明，在NTU-RGB+D 60和NTU-RGB+D 120数据集上所提方法的识别准确率优于C

VL等10余种先进方法，在PKU-MMD-II数据集上识别准确率优于ACA

Net等8种先进方法.本文方法融合骨架结构信息与语义先验，实现了骨架特征与语言语义的有效互补，为低标注成本的骨架

动作识别研究提供了新思路.未来工作将进一步探索基于领域自适应的微调策略，以提升视觉语言模型的开放集描述能力，并构建在线协同优化框架，实现动作描述生成与识别任务的联合优化，从而增强该方法在实时人机交互与智慧教育等复杂动态场景中的实用性、智能化与可解释性.

Abstract

Action recognition aims to model and analyze human motions to automatically identify and understand human behaviors

and it has been widely applied in various fields such as intelligent surveillance

human-computer interaction

and smart education. In recent years

self-supervised skeleton-based action recognition has emerged as an important research area due to its low computational cost

strong adaptability

and minimal reliance on labeled samples. However

existing methods often rely on template-based prompts to generate action concept descriptions

which suffer from the lack of spatio-temporal information and limited semantic modeling capability. To address these issues

this paper proposes a cross-modal prior-assisted self-supervised skeleton-based action recognition method

aiming to effectively integrate skeletal structural features with semantic priors to achieve more semantically rich action representations. On one hand

it employs a dual-branch decoupled skeleton encoder to separately model the spatial structure and temporal dynamics of actions

and integrates a cross-domain contrastive learning strategy to establish feature alignment and consistency constraints from spatial

temporal

and global perspectives

thereby obtaining skeleton-modal features rich in spatio-temporal structure and global context. On the other hand

it feeds temporally concatenated action images along with prompt instructions into a vision-language model to generate action descriptions

and utilizes the text encoder of the contrastive language-image pre-training (CLIP) model to extract text features

thereby supplementing the limited fine-grained semantic representation capability of the skeleton modality. Furthermore

a cross-modal contrastive learning strategy is proposed

where the textual semantics are dynamically modulated unde

r the guidance of skeleton features using a feature-wise linear modulation (FiLM) mechanism

enabling effective semantic alignment between skeleton and text modalities. Experimental results show that the recognition accuracy of the proposed method outperforms more than ten state-of-the-art approaches

including C

on the NTU-RGB+D 60 and NTU-RGB+D 120 datasets

and surpasses eight competitive methods

such as ACA

Net

on the PKU-MMD-II dataset. The proposed method integrates skeletal structural information with semantic priors

achieving effective complementarity between skeleton features and language semantics

and providing a new perspective for skeleton-based action recognition with low annotation cost. In future work

we will further explore domain-adaptive fine-tuning strategies to enhance the open-set description capability of vision-language models

and develop an online collaborative optimization framework to jointly optimize description generation and action recognition

thereby improving the practicality

intelligence

and interpretability of the proposed method in complex dynamic scenarios such as real-time human-computer interaction and smart education.

关键词

Keywords

references

罗会兰 , 曹立京 . 基于多维动态拓扑学习图卷积的骨架动作识别 [J ] . 电子学报 , 2024 , 52 ( 3 ): 991 - 1001 .

LUO H L , CAO L J . Multi-dimensional dynamic topology learning graph convolution for skeleton-based action recognition [J ] . Acta Electronica Sinica , 2024 , 52 ( 3 ): 991 - 1001 . (in Chinese)

YANG S Y , LIU J , LU S J , et al . Skeleton cloud colorization for unsupervised 3D action representation learning [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2022 : 13403 - 13413 .

MAO Y Y , DENG J J , ZHOU W G , et al . Masked motion predictors are strong 3D action representation learners [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2024 : 10147 - 10157 .

XU R Z , HUANG L Z , WANG M , et al . Skeleton2vec: A self-supervised learning framework with contextualized target representations for skeleton sequence [EB/OL ] . ( 2024-01-01 )[ 2025-07-22 ] . https://arXiv.org/abs/2401.00921 https://arXiv.org/abs/2401.00921 .

CAO W M , QIAN L X , ZHANG Y C , et al . Asymmetric context-guided adaptive alignment network for skeleton-based action recognition [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2025 , 35 ( 6 ): 5939 - 5951 .

MAO Y Y , ZHOU W G , LU Z B , et al . CMD: Self-supervised 3D action representation learning with cross-modal mutual distillation [C ] // Computer Vision–ECCV 2022 . Cham : Springer , 2022 : 734 - 752 .

LIN L L , ZHANG J H , LIU J Y . Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 2363 - 2372 .

ZHANG J H , LIN L L , LIU J Y . Hierarchical consistent contrastive learning for skeleton-based action recognition with growing augmentations [C ] // Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence . New York : ACM , 2023 : 3427 - 3435 .

SUN S K , LIU D Z , DONG J F , et al . Unified multi-modal unsupervised representation learning for skeleton-based action understanding [C ] // Proceedings of the 31st ACM International Conference on Multimedia . New York : ACM , 2023 : 2973 - 2984 .

WU C , WU X J , KITTLER J , et al . SCD-net: Spatiotemporal clues disentanglement network for self-supervised skeleton-based action recognition [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 6 ): 5949 - 5957 .

WENG W J , WANG H S , WANG J B , et al . USDRL: Unified skeleton-based dense representation learning with multi-grained feature decorrelation [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2025 , 39 ( 8 ): 8332 - 8340 .

ZHANG J H , LIN L L , LIU J Y . Prompted contrast with masked motion modeling: Towards versatile 3D action representation learning [C ] // Proceedings of the 31st ACM International Conference on Multimedia . New York : ACM , 2023 : 7175 - 7183 .

ZHU X Y , SHU X B , TANG J H . Motion-aware mask feature reconstruction for skeleton-based action recognition [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2024 , 34 ( 11 ): 10718 - 10731 .

CHEN Y , HE T , FU J F , et al . Vision-language meets the skeleton: Progressively distillation with cross-modal knowledge for 3D action representation learning [J ] . IEEE Transactions on Multimedia , 2025 , 27 : 2293 - 2303 .

LIU S L , ZENG Z Y , REN T H , et al . Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection [C ] // Computer Vision–ECCV 2024 . Cham : Springer , 2025 : 38 - 55 .

LIU H , LI C , WU Q , et al . Visual instruction tuning [C ] // The 37th Conference on Neural Information Processing Systems . New York : Curran Associates Inc , 2023 : 34892 - 34916 .

RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [C ] // The 38th International Conference on Machine Learning . Cambridge : PMLR , 2021 , 139 : 8748 - 8763 .

ZHANG J Y , HUANG J X , JIN S , et al . Vision-language models for vision tasks: A survey [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024 , 46 ( 8 ): 5625 - 5644 .

才华 , 易亚希 , 付强 , 等 . 基于跨模态引导和对齐的多模态预训练方法 [J ] . 电子学报 , 2024 , 52 ( 10 ): 3368 - 3381 .

CAI H , YI Y X , FU Q , et al . Multimodal pretraining with cross-modal guidance and alignment [J ] . Acta Electronica Sinica , 2024 , 52 ( 10 ): 3368 - 3381 . (in Chinese)

LI J N , LI D X , XIONG C M , et al . BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation [C ] // The 39th International Conference on Machine Learning . Cambridge : PMLR , 2022 , 162 : 12888 - 12900 .

LIN B , YE Y , ZHU B , et al . Video-LLaVA: Learning united visual representation by alignment before projection [C ] // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing . Stroudsburg : ACL , 2024 : 5971 - 5984 .

LI K C , WANG Y L , HE Y N , et al . MVBench: A comprehensive multi-modal video understanding benchmark [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 22195 - 22206 .

FEI J J , LI D , DENG Z D , et al . Video-CCAM: Enhancing video-language understanding with causal cross-attention masks for short and long videos [EB/OL ] . ( 2024-08-26 )[ 2025-07-22 ] . https://arXiv.org/abs/2408.14023 https://arXiv.org/abs/2408.14023 .

LU H Y , LIU W , ZHANG B , et al . DeepSeek-VL: Towards real-world vision-language understanding [EB/OL ] . ( 2024-03-11 )[ 2025-07-22 ] . https://arXiv.org/abs/2403.05525 https://arXiv.org/abs/2403.05525 .

LU S Y , LI Y , CHEN Q G , et al . Ovis: Structural embedding alignment for multimodal large language model [EB/OL ] . ( 2024-06-17 )[ 2025-07-22 ] . https://arxiv.org/abs/2405.20797 https://arxiv.org/abs/2405.20797 .

CHEN Y X , ZHANG Z Q , YUAN C F , et al . Channel-wise topology refinement graph convolution for skeleton-based action recognition [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 13339 - 13348 .

CHEN X L , FAN H Q , GIRSHICK R , et al . Improved baselines with momentum contrastive learning [EB/OL ] . ( 2020-03-09 )[ 2025-07-22 ] . https://arXiv.org/abs/2003.04297 https://arXiv.org/abs/2003.04297 .

VAN DEN OORD A , LI Y Z , VINYALS O . Representation learning with contrastive predictive coding [EB/OL ] . ( 2019-01-22 )[ 2025-07-22 ] . https://arXiv.org/abs/1807.03748 https://arXiv.org/abs/1807.03748 .

Ultralytics . ultralytics/yolov5: v7. 0 - YOLOv5 SOTA realtime instance segmentation [EB/OL ] . ( 2022-11-22 )[ 2025-07-22 ] . https://github.com/ultralytics/yolov5/releases https://github.com/ultralytics/yolov5/releases .

PEREZ E , STRUB F , DE VRIES H , et al . FiLM: Visual reasoning with a general conditioning layer [EB/OL ] . ( 2017-12-18 )[ 2025-07-22 ] . https://arxiv.org/abs/1709.07871 https://arxiv.org/abs/1709.07871 .

SHAHROUDY A , LIU J , NG T T , et al . NTU RGB+D: A large scale dataset for 3D human activity analysis [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2016 : 1010 - 1019 .

LIU J , SHAHROUDY A , PEREZ M , et al . NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2020 , 42 ( 10 ): 2684 - 2701 .

LIU C H , HU Y Y , LI Y H , et al . PKU-MMD: A large scale benchmark for continuous multi-modal human action understanding [EB/OL ] . ( 2017-03-28 )[ 2025-07-22 ] . https://arXiv.org/abs/1703.07475 https://arXiv.org/abs/1703.07475 .

QIAN N . On the momentum term in gradient descent learning algorithms [J ] . Neural Networks , 1999 , 12 ( 1 ): 145 - 151 .

CHENG Y B , CHEN X P , CHEN J H , et al . Hierarchical transformer: Unsupervised representation learning for skeleton-based human action recognition [C ] // 2021 IEEE International Conference on Multimedia and Expo . Piscataway : IEEE , 2021 : 9428459 .

KIM B , CHANG H J , KIM J , et al . Global-local motion transformer for unsupervised skeleton-based action learning [C ] // Computer Vision - ECCV 2022 . Cham : Springer , 2022 : 209 - 225 .

LI L G , WANG M S , NI B B , et al . 3D human action representation learning via cross-view consistency pursuit [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2021 : 4739 - 4748 .

GUO T Y , LIU H , CHEN Z , et al . Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 1 ): 762 - 770 .

HU J H , HOU Y H , GUO Z H , et al . Global and local contrastive learning for self-supervised skeleton-based action recognition [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2024 , 34 ( 11 ): 10578 - 10589 .

WANG X H , MU Y D . Localized linear temporal dynamics for self-supervised skeleton action recognition [J ] . IEEE Transactions on Multimedia , 2024 , 26 : 10189 - 10199 .

LIN L L , ZHANG J H , LIU J Y . Mutual information driven equivariant contrastive learning for 3D action representation learning [J ] . IEEE Transactions on Image Processing , 2024 , 33 : 1883 - 1897 .

THOKER F M , DOUGHTY H , SNOEK C G M . Skeleton-contrastive 3D action representation learning [C ] // Proceedings of the 29th ACM International Conference on Multimedia . New York : ACM , 2021 : 1655 - 1663 .

MAATEN L , HINTON G . Visualizing data using t-SNE [J ] . Journal of Machine Learning Research , 2008 , 9 : 2579 - 2605 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Vulnerability Knowledge Graph Construction and Completion with Dual-Modality Perception

Progressive Image Synthesis Method Based on Diffusion-Mamba and Scale-Invariant Loss

Neighborhood and Hypergraph Collaboration for Session-Based Recommendation

Construction and Analysis of Cross-Modal General Feature Space Driven by Prior Information

SPECTRAL OSCILLATOR STRENGTHS AND Ω_λ PARAMETERS FOR Ho³⁺ AND Er³⁺ IN YLiF₄ CRYSTALS

Related Author

ZHANG Yan

LUO Xiang-yu

QIN Zi-yue

ZHANG Miao

LI Zhi-fei

LI Hao

HAO Wen-ning

ZOU Shi-chen

Related Institution

Hubei Engineering Research Center of Cyber Security for Intelligent Connected Vehicles

Key Laboratory of Intelligent Sensing System and Security, Ministry of Education

Hubei Key Laboratory of Big Data Intelligent Analysis and Application

School of Cyber Science and Technology, Hubei University

School of Computer Science, Hubei University

⁰