融合大语言模型与跨域图结构的多模态对话情感识别方法

黄辰; 马浩博; 张龑; 杨超; 宋建华

doi:10.12263/DZXB.20250772

您当前的位置：

首页 >

文章列表页 >

融合大语言模型与跨域图结构的多模态对话情感识别方法

学术论文 | 更新时间：2026-06-04

- 融合大语言模型与跨域图结构的多模态对话情感识别方法
- Fusing Large Language Models with Cross-Domain Graphs for Multimodal Emotion Recognition in Dialogue
- 电子学报 2026年54卷第1期页码：340-351
- 作者机构：
  
  1.湖北大学计算机学院，湖北武汉 430062
  2.智能感知系统与安全教育部重点实验室，湖北武汉 430062
  3.大数据智能分析与行业应用湖北省重点实验室，湖北武汉 430062
  4.湖北省高校人文社科重点研究基地-绩效评价信息管理研究中心，湖北武汉 430062
  5.湖北大学网络空间安全学院，湖北武汉 430062
- 作者简介：
  
  [ "黄辰男，1983年8月生，福建龙岩人。现为湖北大学计算机学院教授。主要研究方向为人工智能、脑机接口。E-mail: huang@hubu.edu.cn" ]
  [ "马浩博男，2000年5月生，河北秦皇岛人。现为湖北大学计算机学院硕士研究生。主要研究方向为人工智能、脑科学、情感分析。E-mail: 202321116012629@stu.hubu.edu.cn" ]
  [ "张龑男，1974年6月生，湖北宜昌人。现为湖北大学计算机学院教授。主要研究方向为信息安全、大数据分析。中国电子学会会员编号：E190197582M。E-mail: zhangyan@hubu.edu.cn" ]
  [ "杨超男，1982年9月生，湖北武汉人。现为湖北大学计算机学院教授。主要研究方向为智能计算、信息安全等。E-mail: stevenyc@hubu.edu.cn" ]
  [ "宋建华女，1973年3月生，湖北襄阳人。现为湖北大学网络空间安全学院教授。主要研究方向为网络与信息安全。E-mail: sjhhubu@126.com" ]
- 基金信息：
  
  湖北省重大攻关项目(2023BAA018);湖北省科技计划重大科技专项(2024BAA008)
- DOI：10.12263/DZXB.20250772
  中图分类号： TP391;TP399
- 收稿：2025-09-05，
  
  录用：2026-01-08，
  
  纸质出版：2026-01-25
- 稿件说明：
移动端阅览
黄辰, 马浩博, 张龑, 等. 融合大语言模型与跨域图结构的多模态对话情感识别方法[J]. 电子学报, 2026, 54(01): 340-351.

HUANG Chen, MA Haobo, ZHANG Yan, et al. Fusing Large Language Models with Cross-Domain Graphs for Multimodal Emotion Recognition in Dialogue[J]. Acta Electronica Sinica, 2026, 54(01): 340-351.
黄辰, 马浩博, 张龑, 等. 融合大语言模型与跨域图结构的多模态对话情感识别方法[J]. 电子学报, 2026, 54(01): 340-351. DOI：10.12263/DZXB.20250772

HUANG Chen, MA Haobo, ZHANG Yan, et al. Fusing Large Language Models with Cross-Domain Graphs for Multimodal Emotion Recognition in Dialogue[J]. Acta Electronica Sinica, 2026, 54(01): 340-351. DOI：10.12263/DZXB.20250772

摘要

多模态对话情感识别（Multimodal Emotion Recognition in Conversation， MERC）通过融合文本、语音、视觉等多模态信息来识别对话中的情感状态。随着对话式人工智能和情感计算的快速发展，MERC成为情感计算和人机交互领域的研究热点。相比传统单一模态情感识别，多模态方法能够更全面、精确地捕捉情感的多维特征，如文本传递显性情感内容，语音提供音调、语速等隐性情感线索，视觉信息（如面部表情）则反映情感的非语言表现。这些模态信息相互补充，有助于提高情感识别的准确性和鲁棒性。然而，多模态情感识别面临诸多挑战：首先，不同模态的数据在信息表示上存在显著差异，传统的特征拼接或加权平均方法无法充分捕捉模态间复杂的交互关系，容易导致信息丢失；其次，情感识别任务常常遭遇局部噪声和离群样本干扰，影响模型稳定性；最后，情感识别的准确性与对话上下文的综合利用密切相关，情感往往受到前后文的影响，因此，如何有效提取和利用上下文信息是提高准确性的一大挑战。为应对这些问题，本文结合大语言模型（Large Language Model，LLM）与全局-局部跨域图结构，提出了LLM-EmoGraph方法，旨在实现多模态数据的精确融合与高效建模。该方法引入多模态掩码机制来处理不同模态之间的缺失和不一致信息，确保模型在信息不完整时依然保持较好性能。通过大规模跨域多图预训练，LLM-EmoGraph提升了多模态间及图结构间的迁移能力，增强了模型的鲁棒性。其创新的自适应双尺度特征融合策略实现了文本、语音和视觉信息的高效对齐，提升了情感识别精度，尤其在多模态高度交互的情境下表现优异。此外，结合大语言模型的弱监督层次化情感分类方案，通过逐层引导情感信息提取，有效避免了全局情感模式的干扰，使得即使在有限标注数据下，模型也能准确学习情感特征。实验结果表明，LLM-EmoGraph在多个基准数据集上显著超越现有主流方法，验证了其在多模态情感识别中的有效性和先进性。总体而言，LLM-EmoGraph通过创新的多模态融合策略、大规模预训练和弱监督学习方法，解决了多模态情感识别中的一系列问题，为提升情感识别系统的准确性和稳定性提供了有力支持。

Abstract

Multimodal emotion recognition in conversation (MERC) refers to the identification of emotional states in conversations by integrating various modalities such as text

speech

and visual information. With the rapid development of conversational AI and affective computing

MERC has become a research hotspot in the fields of affective computing and human-computer interaction. Compared to traditional unimodal emotion recognition

multimodal approaches can capture the multifaceted characteristics of emotions more comprehensively and accurately. For instance

text conveys explicit emotional content

speech provides subtle emotional cues like tone

speed

and intonation

while visual information (such as facial expressions) reflects non-verbal emotional expressions. These multimodal signals complement each other

enhancing the accuracy and robustness of emotion recognition. However

multimodal emotion recognition faces several challenges. First

there are significant differences in the representation of information across different modalities

and traditional methods like feature concatenation or weighted averaging fail to fully capture the complex interactions between modalities

which can lead to information loss. Second

emotion recognition tasks often suffer from local noise and outlier samples

which can degrade model stability. Lastly

the accuracy of emotion recognition is closely tied to the effective use of contextual information in a conversation

as emotions are often influenced by preceding and succeeding dialogue. Thus

how to effectively extract and utilize contextual information becomes a major challenge in improving accuracy. To address these issues

this paper proposes a novel emotion recognition method

LLM-EmoGraph

which combines large language model (LLM) with global-local cross-domain graph structures to achieve precise fusion and efficient modeling of multimodal data. This method introduces a multimodal masking mechanism to handle missing and inconsistent information across modalities

ensuring that the model maintains good performance even with incomplete or low-quality information. Through large-scale cross-domain multi-graph pretraining

LLM-EmoGraph enhances the model’s transferability between modalities and graph structures

further improving its robustness. The innovative adaptive dual-scale feature fusion strategy aligns textual

speech

and visual semantic features efficiently

improving emotion recognition accuracy

particularly in scenarios involving high interaction among modalities. Additionally

the paper designs a weakly supervised hierarchical emotion classification scheme based on LLM. This approach guides the extraction of emotional information layer by layer

effectively preventing interference from global emotional patterns

and allows the model to learn emotional features accurately

even with limited annotated data. Experimental results show that LLM-EmoGraph significantly outperforms existing mainstream methods on multiple benchmark datasets

demonstrating its effectiveness and advancement in multimodal emotion recognition tasks. In summary

LLM-EmoGraph

through its innovative multimodal fusion strategies

large-scale pretraining

and weakly supervised learning methods

provides effective solutions to a series of challenges in multimodal emotion recognition

offering strong support for improving the accuracy and stability of emotion recognition systems.

关键词

Keywords

references

Van Kleef G A , Côté S . The social effects of emotions [J ] . Annual Review of Psychology , 2022 , 73 : 629 - 658 . DOI: 10.1146/annurev-psych-020821-010855 http://dx.doi.org/10.1146/annurev-psych-020821-010855

Zhu T , Li L D , Yang J F , et al . Multimodal sentiment analysis with image-text interaction network [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 3375 - 3385 . DOI: 10.1109/tmm.2022.3160060 http://dx.doi.org/10.1109/tmm.2022.3160060

Zhu T , Li L D , Yang J F , et al . Multimodal emotion classification with multi-level semantic reasoning network [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 6868 - 6880 . DOI: 10.1109/tmm.2022.3214989 http://dx.doi.org/10.1109/tmm.2022.3214989

Nie W Z , Bao Y R , Zhao Y , et al . Long dialogue emotion detection based on commonsense knowledge graph guidance [J ] . IEEE Transactions on Multimedia , 2024 , 26 : 514 - 528 . DOI: 10.1109/tmm.2023.3267295 http://dx.doi.org/10.1109/tmm.2023.3267295

Wei L W , Hu D , Zhou W , et al . Modeling both intra- and inter-modality uncertainty for multimodal fake news detection [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 7906 - 7916 . DOI: 10.1109/tmm.2022.3229966 http://dx.doi.org/10.1109/tmm.2022.3229966

Liu K , Xue F , Guo D , et al . Multimodal graph contrastive learning for multimedia-based recommendation [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 9343 - 9355 . DOI: 10.1109/tmm.2023.3251108 http://dx.doi.org/10.1109/tmm.2023.3251108

Wen J Y , Qin F W , Du J , et al . MsgFusion: Medical semantic guided two-branch network for multimodal brain image fusion [J ] . IEEE Transactions on Multimedia , 2024 , 26 : 944 - 957 . DOI: 10.1109/tmm.2023.3273924 http://dx.doi.org/10.1109/tmm.2023.3273924

Poria S , Majumder N , Mihalcea R , et al . Emotion recognition in conversation: Research challenges, datasets, and recent advances [J ] . IEEE Access , 2019 , 7 : 100943 - 100953 . DOI: 10.1109/access.2019.2929050 http://dx.doi.org/10.1109/access.2019.2929050

Wang Y X , Liu M , Li Z , et al . Unlocking the power of multimodal learning for emotion recognition in conversation [C ] // Proceedings of the 31st ACM International Conference on Multimedia . New York : ACM , 2023 : 5947 - 5955 . DOI: 10.1145/3581783.3613846 http://dx.doi.org/10.1145/3581783.3613846

Wang P , Ganushchak L , Welie C , et al . The dynamic nature of emotions in language learning context: Theory, method, and analysis [J ] . Educational Psychology Review , 2024 , 36 ( 4 ): 105 . DOI: 10.1007/s10648-024-09946-2 http://dx.doi.org/10.1007/s10648-024-09946-2

Zhang D , Wu L Q , Sun C L , et al . Modeling both context- and speaker-sensitive dependence for emotion detection in multi-speaker conversations [C ] // Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence . International Joint Conferences on Artificial Intelligence Organization , 2019 : 5415 - 5421 . DOI: 10.24963/ijcai.2019/752 http://dx.doi.org/10.24963/ijcai.2019/752

Nie W Z , Chang R H , Ren M J , et al . I-GCN: Incremental graph convolution network for conversation emotion detection [J ] . IEEE Transactions on Multimedia , 2022 , 24 : 4471 - 4481 . DOI: 10.1109/tmm.2021.3118881 http://dx.doi.org/10.1109/tmm.2021.3118881

Joshi A , Bhat A , Jain A , et al . COGMEN: Contextualized GNN based multimodal emotion recognition [C ] // Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Stroudsburg : ACL , 2022 : 4148 - 4164 . DOI: 10.18653/v1/2022.naacl-main.306 http://dx.doi.org/10.18653/v1/2022.naacl-main.306

Hu J W , Liu Y C , Zhao J M , et al . MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation [C ] // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing . Stroudsburg : ACL , 2021 : 5666 - 5675 . DOI: 10.18653/v1/2021.acl-long.440 http://dx.doi.org/10.18653/v1/2021.acl-long.440

Li J , Wang X P , Lv G Q , et al . GraphMFT: A graph network based multimodal fusion technique for emotion recognition in conversation [J ] . Neurocomputing , 2023 , 550 : 126427 . DOI: 10.1016/j.neucom.2023.126427 http://dx.doi.org/10.1016/j.neucom.2023.126427

Li J , Wang X P , Lv G Q , et al . GraphCFC: A directed graph based cross-modal feature complementation approach for multimodal conversational emotion recognition [J ] . IEEE Transactions on Multimedia , 2024 , 26 : 77 - 89 . DOI: 10.1109/tmm.2023.3260635 http://dx.doi.org/10.1109/tmm.2023.3260635

Li J , Wang X P , Lv G Q , et al . GA2MIF: Graph and attention based two-stage multi-source information fusion for conversational emotion detection [J ] . IEEE Transactions on Affective Computing , 2024 , 15 ( 1 ): 130 - 143 . DOI: 10.1109/taffc.2023.3261279 http://dx.doi.org/10.1109/taffc.2023.3261279

Radford A , Kim J W , Hallacy C , et al . Learning transferable visual models from natural language supervision [C/OL ] // Proceedings of the 38th International Conference on Machine Learning , PMLR 139 , 2021 : 8748 - 8763 . https://proceedings.mlr.press/v139/radford21a https://proceedings.mlr.press/v139/radford21a .

Jia Chao , Yang Yinfei , Xia Ye , et al . Scaling up visual and vision-language representation learning with noisy text supervision [C/OL ] // Proceedings of the 38th International Conference on Machine Learning , PMLR 139 , 2021 : 4904 - 4916 . https://proceedings.mlr.press/v139/jia21b.html https://proceedings.mlr.press/v139/jia21b.html . DOI: 10.1145/3474085.3475622 http://dx.doi.org/10.1145/3474085.3475622

Dosovitskiy A , Beyer L , Kolesnikov A , et al . An image is worth 16 × 16 words: Transformers for image recognition at scale [PP/OL ] . V2.arXiv ( 2021-06-03 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.2010.11929 https://doi.org/10.48550/arXiv.2010.11929 .

Jaegle A , Borgeaud S , Alayrac J B , et al . Perceiver IO: A general architecture for structured inputs & outputs [PP/OL ] . V3.arXiv ( 2022-03-15 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.2107.14795 https://doi.org/10.48550/arXiv.2107.14795 .

Bao H B , Dong L , Piao S H , et al . BEiT: BERT pre-training of image transformers [PP/OL ] . V2.arXiv ( 2022-09-03 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.2106.08254 https://doi.org/10.48550/arXiv.2106.08254 .

He K M , Chen X L , Xie S N , et al . Masked autoencoders are scalable vision learners [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 15979 - 15988 . DOI: 10.1109/cvpr52688.2022.01553 http://dx.doi.org/10.1109/cvpr52688.2022.01553

Girdhar R , El-Nouby A , Liu Z , et al . ImageBind one embedding space to bind them all [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 15180 - 15190 . DOI: 10.1109/cvpr52729.2023.01457 http://dx.doi.org/10.1109/cvpr52729.2023.01457

Chen X , Zhang N Y , Li L , et al . Hybrid transformer with multi-level fusion for multimodal knowledge graph completion [C ] // Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval . New York : ACM , 2022 : 904 - 915 . DOI: 10.1145/3477495.3531992 http://dx.doi.org/10.1145/3477495.3531992

Zeng Y W , Jin Q , Bao T F , et al . Multi-modal knowledge hypergraph for diverse image retrieval [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2023 , 37 ( 3 ): 3376 - 3383 . DOI: 10.1609/aaai.v37i3.25445 http://dx.doi.org/10.1609/aaai.v37i3.25445

Jin W G , Yang K , Barzilay R , et al . Learning multimodal graph-to-graph translation for molecular optimization [PP/OL ] . V3.arXiv ( 2019-01-28 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.1812.01070 https://doi.org/10.48550/arXiv.1812.01070 .

Wang M L , Shao W , Huang S , et al . Hypergraph-regularized multimodal learning by graph diffusion for imaging genetics based Alzheimer’s Disease diagnosis [J ] . Medical Image Analysis , 2023 , 89 : 102883 . DOI: 10.1016/j.media.2023.102883 http://dx.doi.org/10.1016/j.media.2023.102883

Yoon M , Koh J Y , Hooi B , et al . Multimodal graph learning for generative tasks [PP/OL ] . V2.arXiv ( 2023-10-12 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.2310.07478 https://doi.org/10.48550/arXiv.2310.07478 .

Sahu G , Vechtomova O . Adaptive fusion techniques for multimodal data [PP/OL ] . V2.arXiv ( 2021-01-26 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.1911.03821 https://doi.org/10.48550/arXiv.1911.03821 .

Li S Z , Zhang T , Chen B N , et al . MIA-net: Multi-modal interactive attention network for multi-modal affective analysis [J ] . IEEE Transactions on Affective Computing , 2023 , 14 ( 4 ): 2796 - 2809 . DOI: 10.1109/taffc.2023.3259010 http://dx.doi.org/10.1109/taffc.2023.3259010

Zheng J H , Zhang S , Wang Z L , et al . Multi-channel weight-sharing autoencoder based on cascade multi-head attention for multimodal emotion recognition [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 2213 - 2225 . DOI: 10.1109/tmm.2022.3144885 http://dx.doi.org/10.1109/tmm.2022.3144885

Zhao S L , Liu Y C , Jiao Q , et al . Mitigating modality discrepancies for RGB-T semantic segmentation [J ] . IEEE Transactions on Neural Networks and Learning Systems , 2024 , 35 ( 7 ): 9380 - 9394 . DOI: 10.1109/tnnls.2022.3233089 http://dx.doi.org/10.1109/tnnls.2022.3233089

Hu D , Hou X L , Wei L W , et al . MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations [C ] // ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2022 : 7037 - 7041 . DOI: 10.1109/icassp43922.2022.9747397 http://dx.doi.org/10.1109/icassp43922.2022.9747397

Shi T , Huang S L . MultiEMO: An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations [C ] // Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics . Stroudsburg : ACL , 2023 : 14752 - 14766 . DOI: 10.18653/v1/2023.acl-long.824 http://dx.doi.org/10.18653/v1/2023.acl-long.824

Wang X , Liu N , Han H , et al . Self-supervised heterogeneous graph neural network with co-contrastive learning [C ] // Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining . New York : ACM , 2021 : 1726 - 1736 . DOI: 10.1145/3447548.3467415 http://dx.doi.org/10.1145/3447548.3467415

Yu J X , Li X . Heterogeneous graph contrastive learning with meta-path contexts and weighted negative samples [C ] // Proceedings of the 2023 SIAM International Conference on Data Mining (SDM) . Philadelphia : PASociety for Industrial and Applied Mathematics , 2023 : 37 - 45 . DOI: 10.1137/1.9781611977653.ch5 http://dx.doi.org/10.1137/1.9781611977653.ch5

Shen J M , Qiu W D , Meng Y , et al . TaxoClass: Hierarchical multi-label text classification using only class names [C ] // Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Stroudsburg : ACL , 2021 : 4239 - 4249 . DOI: 10.18653/v1/2021.naacl-main.335 http://dx.doi.org/10.18653/v1/2021.naacl-main.335

Devlin J , Chang M W , Lee K , et al . BERT: Pre-training of deep bidirectional transformers for language understanding [C ] // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Kerrville : Association for Computational Linguistics 2019 : 4171 - 4186 . DOI: 10.18653/v1/n19-1423 http://dx.doi.org/10.18653/v1/n19-1423

Busso C , Bulut M , Lee C C , et al . IEMOCAP: Interactive emotional dyadic motion capture database [J ] . Language Resources and Evaluation , 2008 , 42 ( 4 ): 335 - 359 . DOI: 10.1007/s10579-008-9076-6 http://dx.doi.org/10.1007/s10579-008-9076-6

Poria S , Hazarika D , Majumder N , et al . MELD: A multimodal multi-party dataset for emotion recognition in conversations [PP/OL ] . V6.arXiv ( 2019-06-04 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.1810.02508 https://doi.org/10.48550/arXiv.1810.02508 .

Zadeh A , Zellers R , Pincus E , et al . Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos [PP/OL ] . V2.arXiv ( 2016-08-12 )[ 2025-09-05 ] . https://arxiv.org/abs/1606.06259 https://arxiv.org/abs/1606.06259 . DOI: 10.1109/mis.2016.94 http://dx.doi.org/10.1109/mis.2016.94

Majumder N , Poria S , Hazarika D , et al . DialogueRNN: An attentive RNN for emotion detection in conversations [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2019 , 33 ( 1 ): 6818 - 6825 . DOI: 10.1609/aaai.v33i01.33016818 http://dx.doi.org/10.1609/aaai.v33i01.33016818

Ghosal D , Majumder N , Poria S , et al . DialogueGCN: A graph convolutional neural network for emotion recognition in conversation [PP/OL ] . V1.arXiv ( 2019-08-30 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.1908.11540 https://doi.org/10.48550/arXiv.1908.11540 .

Ma H , Wang J , Lin H F , et al . A transformer-based model with self-distillation for multimodal emotion recognition in conversations [J ] . IEEE Transactions on Multimedia , 2024 , 26 : 776 - 788 . DOI: 10.1109/tmm.2023.3271019 http://dx.doi.org/10.1109/tmm.2023.3271019

Zhang C W , Zhang Y H , Cheng B . RL-EMO: A reinforcement learning framework for multimodal emotion recognition [C ] // ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2024 : 10246 - 10250 . DOI: 10.1109/icassp48485.2024.10446459 http://dx.doi.org/10.1109/icassp48485.2024.10446459

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于梯度协同与特征融合的加密流量检测

基于信息融合的区块链系统隐匿安全补丁识别及迁移技术

基于大语言模型语义增强的多模态智能合约漏洞检测方法研究

基于TCAD与DNN的FinFET器件辐射总剂量效应模型

长短包混合传输的互惠共生无线电系统资源分配方法研究