Multimodal Intent Recognition Based on Hierarchical Semantic-Consistency Learning

PENG Jun-jie; LI Zheng-yi; ZHANG Huan-xiang; WANG Lan

doi:10.12263/DZXB.20250009

您当前的位置：

首页 >

文章列表页 >

Multimodal Intent Recognition Based on Hierarchical Semantic-Consistency Learning

PAPERS | 更新时间：2025-10-16

- Multimodal Intent Recognition Based on Hierarchical Semantic-Consistency Learning
- ACTA ELECTRONICA SINICA Vol. 53, Issue 6, Pages: 2007-2021(2025)
- 作者机构：
  
  1.上海大学计算机工程与科学学院，上海 200444
  2.内蒙古科技大学创新创业教育学院，内蒙古包头 014010
- 作者简介：
- 基金信息：
  
  Shanghai Service Industry Development Fund(06162021592)
- DOI：10.12263/DZXB.20250009
  CLC： TP391.4;
- Received：02 January 2025，
  
  Revised：2025-05-21，
  
  Published：25 June 2025
- 稿件说明：
移动端阅览
彭俊杰, 李铮一, 张换香, 等. 基于层次化一致性语义学习的多模态意图识别[J]. 电子学报, 2025, 53(06): 2007-2021.

PENG Jun-jie, LI Zheng-yi, ZHANG Huan-xiang, et al. Multimodal Intent Recognition Based on Hierarchical Semantic-Consistency Learning[J]. Acta Electronica Sinica, 2025, 53(06): 2007-2021.
彭俊杰, 李铮一, 张换香, 等. 基于层次化一致性语义学习的多模态意图识别[J]. 电子学报, 2025, 53(06): 2007-2021. DOI：10.12263/DZXB.20250009

PENG Jun-jie, LI Zheng-yi, ZHANG Huan-xiang, et al. Multimodal Intent Recognition Based on Hierarchical Semantic-Consistency Learning[J]. Acta Electronica Sinica, 2025, 53(06): 2007-2021. DOI：10.12263/DZXB.20250009

摘要

多模态意图识别（Multimodal Intent Recognition，MIR）是在现实世界中理解人类意图的重要研究方向，旨在通过融合语言、视觉和音频等多种模态信息来准确判断说话人的意图.然而，现有的MIR研究大多集中在如何为文本模态构建多模态语义环境，对视觉和音频模态中蕴含的大量语义信息（如动作和情感语义）的利用则不够深入.尽管视觉和音频模态富含与意图相关的信息，但其固有的冗余信息和噪声却制约了模型对这些模态特征的有效利用.为解决上述问题，本文提出了一种能够有效利用音频模态语义关系，同时有效抑制冗余信息的MIR模型.该模型通过构建抑制冗余信息的初级语义特征，引导学习不同尺度的模态内与模态间语义关联，以理解说话人的意图.在此基础之上，模型利用不同模态特征间潜在的意图一致性，将提取到的音视频语义特征与具有明确意图语义的文本特征进行配对，从而过滤掉那些单独通过意图识别任务无法消除的无关语义信息.此外，模型采用多模态融合门控机制，整合来自不同模态的意图语义.在多个意图理解任务的数据集上的实验表明：所提出的方法能够有效提取音视频模态语义并滤除意图识别无关语义，且在性能上优于现有的MIR方法.具体而言，在准确率（ACCuracy，ACC）值、精确度（Precision，P）值、召回率（Recall，R）值和

值（

score，

）上均取得了0.7~1.8个百分点的提升.

Abstract

Multimodal intent recognition (MIR) is a critical research for understanding human intent in the real world. It aims to judge the speaker’s intent through multiple modalities including language

visual and audio modalities. However

existing studies in MIR primarily focus on constructing multimodal semantic environments for textual data

while the utilization of rich semantic information in visual and audio modalities

such as action and emotional semantics

remains insufficiently explored. Despite the visual and audio modalities carrying intents-related semantics

their inherent redundant information and noise hinder the effective use of these modalities. To address these challenges

this paper proposes a more effective MIR model that better leverages audio and visual information while suppressing redundant information. The proposed model understands the speaker’s intent by constructing primary semantic features that suppress redundant information and guiding the

learning of intra-modality and inter-modality semantic associations at different scales. Based on this

the model leverages the potential intent consistency across different modalities and pair audio and visual representations with textual features

which contain more explicit intent-related semantics

to filter out irrelevant semantics that cannot be eliminated by intent recognition tasks. Furthermore

the model uses multi-modal fusion gating mechanism to integrate intent semantics from different modalities. Experiments on several datasets of intents understanding tasks show that the proposed method can effectively extract the modal semantics of audio and video and filter out the irrelevant semantics of intent recognition

and outperforms the existing MIR methods

achieving 0.7 to 1.8 percentage points improvement in accuracy (ACC)

precision (P)

recall (R) and

score (

关键词

Keywords

references

杨帆 , 饶元 , 丁毅 , 等 . 面向任务型的对话系统研究进展 [J ] . 中文信息学报 , 2021 , 35 ( 10 ): 1 - 20 .

YANG F , RAO Y , DING Y , et al . Progress in task-oriented dialogue system [J ] . Journal of Chinese Information Processing , 2021 , 35 ( 10 ): 1 - 20 . (in Chinese)

MEI J , WANG Y F , TU X H , et al . Incorporating BERT with probability-aware gate for spoken language understanding [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2023 , 31 : 826 - 834 .

ZHANG F , CHEN W , DING F , et al . Dual class knowledge propagation network for multi-label few-shot intent detection [C ] // Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Stroudsburg : ACL , 2023 : 8605 - 8618 .

WEI Y W , YUAN S Z , YANG R S , et al . Tackling modality heterogeneity with multi-view calibration network for multimodal sentiment detection [C ] // Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Stroudsburg : ACL , 2023 : 5240 - 5252 .

SAHA T , PATRA A , SAHA S , et al . Towards emotion-aided multi-modal dialogue act classification [C ] // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : ACL , 2020 : 4361 - 4372 .

ZHANG H L , XU H , WANG X , et al . MIntRec: A new dataset for multimodal intent recognition [C ] // Proceedings of the 30th ACM International Conference on Multimedia . New York : ACM , 2022 : 1688 - 1697 .

ZHANG H , WANG X , XU H , et al . MIntRec2.0: A large-scale benchmark dataset for multimodal intent recognition and out-of-scope detection in conversations [C ] // The 12th International Conference on Learning Representations . Washington DC : ICLR , 2024 : 1 - 16 .

HUANG X J , MA T H , JIA L , et al . An effective multimodal representation and fusion method for multimodal intent recognition [J ] . Neurocomputing , 2023 , 548 : 126373 .

ZHOU Q R , XU H , LI H , et al . Token-level contrastive learning with modality-aware prompting for multimodal intent recognition [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 15 ): 17114 - 17122 .

SUN K L , XIE Z W , YE M , et al . Contextual augmented global contrast for multimodal intent recognition [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2024 : 26953 - 26963 .

ZHENG C Z , PENG J J , WANG L , et al . Frame-level nonverbal feature enhancement based sentiment analysis [J ] . Expert Systems with Applications , 2024 , 258 : 125148 .

ZHENG C Z , PENG J J , CAI Z S . Extracting method for fine-grained emotional features in videos [J ] . Knowledge-Based Systems , 2024 , 302 : 112382 .

XUE Z H , MARCULESCU R . Dynamic multimodal fusion [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Piscataway : IEEE , 2023 : 2575 - 2584 .

XIE Y , ZHU Z , LU X , et al . InfoEnh: Towards multimodal sentiment analysis via information bottleneck filter and optimal transport alignment [C ] // Proceedings of the 2024 Joint International Conference on Computational Linguistics . Marseille : LREC . 2024 : 9073 - 9083 .

SU B Y , WANG J , LIU S Q , et al . A CNN-based method for intent recognition using inertial measurement units and intelligent lower limb prosthesis [J ] . IEEE Transactions on Neural Systems and Rehabilitation Engineering , 2019 , 27 ( 5 ): 1032 - 1042 .

MENSIO M , RIZZO G , MORISIO M . Multi-turn QA: A RNN contextual approach to intent classification for goal-oriented systems [C ] // Companion of the Web Conference 2018 on the Web Conference 2018 - WWW’18. New York: ACM , 2018 : 1075 - 1080 .

HU J X , PENG J J , ZHANG W Q , et al . An intention multiple-representation model with expanded information [J ] . Computer Speech Language , 2021 , 68 : 101196 .

XU Q Q , PENG J J , ZHENG C Z , et al . Short text classification of Chinese with label information assisting [J ] . ACM Transactions on Asian and Low-Resource Language Information Processing , 2023 , 22 ( 4 ): 1 - 19 .

张换香 , 彭俊杰 . 基于方面级情感分析的深度语义挖掘模型 [J ] . 电子学报 , 2024 , 52 ( 7 ): 2307 - 2319 .

ZHANG H X , PENG J J . A deep semantic mining model based on aspect-level sentiment analysis [J ] . Acta Electronica Sinica , 2024 , 52 ( 7 ): 2307 - 2319 . (in Chinese)

廉筱峪 , 夏楠 , 戴高乐 , 等 . 复杂噪声环境下基于轻量化模型的车内交互语音增强和识别方法 [J ] . 电子学报 , 2024 , 52 ( 4 ): 1282 - 1287 .

LIAN X Y , XIA N , DAI G L , et al . An in-vehicle interaction speech enhancement and recognition method based on lightweight models in complex environment [J ] . Acta Electronica Sinica , 2024 , 52 ( 4 ): 1282 - 1287 . (in Chinese)

PENG J J , WU T , ZHANG W Q , et al . A fine-grained modal label-based multi-stage network for multimodal sentiment analysis [J ] . Expert Systems with Applications , 2023 , 221 : 119721 .

WANG L , PENG J J , ZHENG C Z , et al . A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning [J ] . Information Processing Management , 2024 , 61 ( 3 ): 103675 .

LI Z Y , PENG J J , LIN X C , et al . Multimodal intent recognition based on text-guided cross-modal attention [J ] . Applied Intelligence , 2025 , 55 ( 10 ): 690 .

ZHANG H L , XU H , WANG X , et al . A clustering framework for unsupervised and semi-supervised new intent discovery [J ] . IEEE Transactions on Knowledge and Data Engineering , 2024 , 36 ( 11 ): 5468 - 5481 .

苏建华 , 池云仙 , 许云峰 , 等 . 基于注意力模态融合的多模态意图识别 [EB/OL ] . ( 2024-01-10 )[ 2024-11-18 ] . http://kns.cnki.net/KCMS/detail/detail.aspx?filename=JSJC20241114005dbname=CJFDdbcode=CJFQ http://kns.cnki.net/KCMS/detail/detail.aspx?filename=JSJC20241114005dbname=CJFDdbcode=CJFQ .

SU J H , CHI Y X , XU Y F , et al . Multimodal intention recognition based on attention modal fusion [EB/OL ] . ( 2024-01-10 )[ 2024-11-18 ] . http://kns.cnki.net/KCMS/detail/detail.aspx?filename=JSJC20241114005dbname=CJFDdbcode=CJFQ. http://kns.cnki.net/KCMS/detail/detail.aspx?filename=JSJC20241114005dbname=CJFDdbcode=CJFQ. (in Chinese)

HUANG S J , QIN L B , WANG B B , et al . SDIF-DA: A shallow-to-deep interaction framework with data augmentation for multi-modal intent detection [C ] // ICASSP 2024 -2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2024 : 10206 - 10210 .

ZADEH A , CHEN M H , PORIA S , et al . Tensor fusion network for multimodal sentiment analysis [C ] // Proceedings of the 2017 Conference on Empirical Methods in NaturalLanguage Processing . Stroudsburg : ACL , 2017 : 1103 - 1114 .

ZADEH A , LIANG P P , MAZUMDER N , et al . Memory fusion network for multi-view sequential learning [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2018 , 32 ( 1 ): 1 - 8 .

MAI S J , HU H F , XU J , et al . Multi-fusion residual memory network for multimodal human sentiment comprehension [J ] . IEEE Transactions on Affective Computing , 2022 , 13 ( 1 ): 320 - 334 .

TSAI Y H , BAI S J , LIANG P P , et al . Multimodal transformer for unaligned multimodal language sequences [J ] . Proceedings of the Conference. Association for Computational Linguistics . 2019 , 1 : 6558 - 6569 .

RAHMAN W , HASAN M K , LEE S W , et al . Integrating multimodal information in large pretrained transformers [J ] . Proceedings of the Conference. Association for Computational Linguistics. Meeting , 2020 , 2020 : 2359 - 2369 .

HAZARIKA D , ZIMMERMANN R , PORIA S . MISA: Modality-invariant and-specific representations for multimodal sentiment analysis [C ] // Proceedings of the 28th ACM International Conference on Multimedia . New York : ACM , 2020 : 1122 - 1131 .

WANG D , LIU S , WANG Q , et al . Cross-modal enhancement network for multimodal sentiment analysis [J ] . IEEE Transactions on Multimedia , 2022 , 25 : 4909 - 4921 .

樊琳 , 龚勋 , 郑岑洋 . 基于文本引导下的多模态医学图像分析算法 [J ] . 电子学报 , 2024 , 52 ( 7 ): 2341 - 2355 .

FAN L , GONG X , ZHENG C Y . A multi-modal medical image analysis algorithm based on text guidance [J ] . ACTA Electronica Sinica , 2024 , 52 ( 7 ): 2341 - 2355 . (in Chinese)

WANG D , GUO X T , TIAN Y M , et al . TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis [J ] . Pattern Recognition , 2023 , 136 : 109259 .

CHEN Y Z , ZHU W H , YU W L , et al . Prompt learning for multimodal intent recognition with modal alignment perception [J ] . Cognitive Computation , 2024 , 16 ( 6 ): 3417 - 3428 .

WU S X , DAI D M , QIN Z W , et al . Denoising bottleneck with mutual information maximization for video multimodal fusion [C ] // Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Stroudsburg : ACL , 2023 : 2231 - 2243 .

KENTON J D M W C , TOUTANOVA L K . Bert: Pre-training of deep bidirectional transformers for language understanding [J ] . Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie . 2019 , 1 : 4171 - 4186 .

BAEVSKI A , ZHOU H , MOHAMED A , et al . wav2vec 2 . 0 : A framework for self-supervised learning of speech representations[EB/OL ] . ( 2020-10-22 )[ 2024-11-18 ] . https://arxiv.org/abs/2006.11477v3 https://arxiv.org/abs/2006.11477v3 .

REN S Q , HE K M , GIRSHICK R , et al . Faster R-CNN: Towards real-time object detection with region proposal networks [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017 , 39 ( 6 ): 1137 - 1149 .

GRAVES A , FERNÁNDEZ S , GOMEZ F , et al . Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks [C ] // Proceedings of the 23rd International Conference on Machine Learning-ICML’06 . New York : ACM , 2006 : 369 - 376 .

ZHANG H Y , WANG Y , YIN G H , et al . Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis [C ] // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . Stroudsburg : ACL , 2023 : 756 - 767 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C ] // Advances in Neural Information Processing Systems . Oakland : NIPS , 2017 : 5998 - 6008 .

ZADEH A , ZELLERS R , PINCUS E , et al . MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos [EB/OL ] . ( 2016-08-12 )[ 2024-11-18 ] . https://arxiv.org/abs/1606.06259v2 https://arxiv.org/abs/1606.06259v2 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Image Enhancement via Content Semantic-Aware Multimodal Fusion

DRE-3DC: Document-Level Relation Extraction with Three-Dimensional Representation Combination Modeling

Efficient Multimodal Contribution Aware Network for Assessment of Microvascular Invasion in Hepatocellular Carcinoma

MoGE: Graph Context Enhanced Multi-Task Recommendation Method

Related Author

ZHU Han-cheng

LI Lei-da

ZHOU Yong

SHAO Zhi-wen

YAO Rui

LIU Xin-yu

ZHAO Wen

LI Wei-ping

Related Institution

School of Artificial Intelligence, Xidian University

Mine Digitization Engineering Research Center of the Ministry of Education

School of Computer Science and Technology / School of Artificial Intelligence, China University of Mining and Technology

School of Software and Microelectronics, Peking University

National Engineering Research Center for Software Engineering, Peking University

⁰