1.湖北大学计算机学院,湖北武汉 430062
2.智能感知系统与安全教育部重点实验室,湖北武汉 430062
3.大数据智能分析与行业应用湖北省重点实验室,湖北武汉 430062
4.湖北省高校人文社科重点研究基地-绩效评价信息管理研究中心,湖北武汉 430062
5.湖北大学网络空间安全学院,湖北武汉 430062
[ "黄辰 男,1983年8月生,福建龙岩人。现为湖北大学计算机学院教授。主要研究方向为人工智能、脑机接口。E-mail: huang@hubu.edu.cn" ]
[ "马浩博 男,2000年5月生,河北秦皇岛人。现为湖北大学计算机学院硕士研究生。主要研究方向为人工智能、脑科学、情感分析。E-mail: 202321116012629@stu.hubu.edu.cn" ]
[ "张龑 男,1974年6月生,湖北宜昌人。现为湖北大学计算机学院教授。主要研究方向为信息安全、大数据分析。中国电子学会会员编号:E190197582M。E-mail: zhangyan@hubu.edu.cn" ]
[ "杨超 男,1982年9月生,湖北武汉人。现为湖北大学计算机学院教授。主要研究方向为智能计算、信息安全等。E-mail: stevenyc@hubu.edu.cn" ]
[ "宋建华 女,1973年3月生,湖北襄阳人。现为湖北大学网络空间安全学院教授。主要研究方向为网络与信息安全。E-mail: sjhhubu@126.com" ]
收稿:2025-09-05,
录用:2026-01-08,
纸质出版:2026-01-25
移动端阅览
黄辰, 马浩博, 张龑, 等. 融合大语言模型与跨域图结构的多模态对话情感识别方法[J]. 电子学报, 2026, 54(01): 340-351.
HUANG Chen, MA Haobo, ZHANG Yan, et al. Fusing Large Language Models with Cross-Domain Graphs for Multimodal Emotion Recognition in Dialogue[J]. Acta Electronica Sinica, 2026, 54(01): 340-351.
黄辰, 马浩博, 张龑, 等. 融合大语言模型与跨域图结构的多模态对话情感识别方法[J]. 电子学报, 2026, 54(01): 340-351. DOI:10.12263/DZXB.20250772
HUANG Chen, MA Haobo, ZHANG Yan, et al. Fusing Large Language Models with Cross-Domain Graphs for Multimodal Emotion Recognition in Dialogue[J]. Acta Electronica Sinica, 2026, 54(01): 340-351. DOI:10.12263/DZXB.20250772
多模态对话情感识别(Multimodal Emotion Recognition in Conversation, MERC)通过融合文本、语音、视觉等多模态信息来识别对话中的情感状态。随着对话式人工智能和情感计算的快速发展,MERC成为情感计算和人机交互领域的研究热点。相比传统单一模态情感识别,多模态方法能够更全面、精确地捕捉情感的多维特征,如文本传递显性情感内容,语音提供音调、语速等隐性情感线索,视觉信息(如面部表情)则反映情感的非语言表现。这些模态信息相互补充,有助于提高情感识别的准确性和鲁棒性。然而,多模态情感识别面临诸多挑战:首先,不同模态的数据在信息表示上存在显著差异,传统的特征拼接或加权平均方法无法充分捕捉模态间复杂的交互关系,容易导致信息丢失;其次,情感识别任务常常遭遇局部噪声和离群样本干扰,影响模型稳定性;最后,情感识别的准确性与对话上下文的综合利用密切相关,情感往往受到前后文的影响,因此,如何有效提取和利用上下文信息是提高准确性的一大挑战。为应对这些问题,本文结合大语言模型(Large Language Model,LLM)与全局-局部跨域图结构,提出了LLM-EmoGraph方法,旨在实现多模态数据的精确融合与高效建模。该方法引入多模态掩码机制来处理不同模态之间的缺失和不一致信息,确保模型在信息不完整时依然保持较好性能。通过大规模跨域多图预训练,LLM-EmoGraph提升了多模态间及图结构间的迁移能力,增强了模型的鲁棒性。其创新的自适应双尺度特征融合策略实现了文本、语音和视觉信息的高效对齐,提升了情感识别精度,尤其在多模态高度交互的情境下表现优异。此外,结合大语言模型的弱监督层次化情感分类方案,通过逐层引导情感信息提取,有效避免了全局情感模式的干扰,使得即使在有限标注数据下,模型也能准确学习情感特征。实验结果表明,LLM-EmoGraph在多个基准数据集上显著超越现有主流方法,验证了其在多模态情感识别中的有效性和先进性。总体而言,LLM-EmoGraph通过创新的多模态融合策略、大规模预训练和弱监督学习方法,解决了多模态情感识别中的一系列问题,为提升情感识别系统的准确性和稳定性提供了有力支持。
Multimodal emotion recognition in conversation (MERC) refers to the identification of emotional states in conversations by integrating various modalities such as text
speech
and visual information. With the rapid development of conversational AI and affective computing
MERC has become a research hotspot in the fields of affective computing and human-computer interaction. Compared to traditional unimodal emotion recognition
multimodal approaches can capture the multifaceted characteristics of emotions more comprehensively and accurately. For instance
text conveys explicit emotional content
speech provides subtle emotional cues like tone
speed
and intonation
while visual information (such as facial expressions) reflects non-verbal emotional expressions. These multimodal signals complement each other
enhancing the accuracy and robustness of emotion recognition. However
multimodal emotion recognition faces several challenges. First
there are significant differences in the representation of information across different modalities
and traditional methods like feature concatenation or weighted averaging fail to fully capture the complex interactions between modalities
which can lead to information loss. Second
emotion recognition tasks often suffer from local noise and outlier samples
which can degrade model stability. Lastly
the accuracy of emotion recognition is closely tied to the effective use of contextual information in a conversation
as emotions are often influenced by preceding and succeeding dialogue. Thus
how to effectively extract and utilize contextual information becomes a major challenge in improving accuracy. To address these issues
this paper proposes a novel emotion recognition method
LLM-EmoGraph
which combines large language model (LLM) with global-local cross-domain graph structures to achieve precise fusion and efficient modeling of multimodal data. This method introduces a multimodal masking mechanism to handle missing and inconsistent information across modalities
ensuring that the model maintains good performance even with incomplete or low-quality information. Through large-scale cross-domain multi-graph pretraining
LLM-EmoGraph enhances the model’s transferability between modalities and graph structures
further improving its robustness. The innovative adaptive dual-scale feature fusion strategy aligns textual
speech
and visual semantic features efficiently
improving emotion recognition accuracy
particularly in scenarios involving high interaction among modalities. Additionally
the paper designs a weakly supervised hierarchical emotion classification scheme based on LLM. This approach guides the extraction of emotional information layer by layer
effectively preventing interference from global emotional patterns
and allows the model to learn emotional features accurately
even with limited annotated data. Experimental results show that LLM-EmoGraph significantly outperforms existing mainstream methods on multiple benchmark datasets
demonstrating its effectiveness and advancement in multimodal emotion recognition tasks. In summary
LLM-EmoGraph
through its innovative multimodal fusion strategies
large-scale pretraining
and weakly supervised learning methods
provides effective solutions to a series of challenges in multimodal emotion recognition
offering strong support for improving the accuracy and stability of emotion recognition systems.
Van Kleef G A , Côté S . The social effects of emotions [J ] . Annual Review of Psychology , 2022 , 73 : 629 - 658 . DOI: 10.1146/annurev-psych-020821-010855 http://dx.doi.org/10.1146/annurev-psych-020821-010855
Zhu T , Li L D , Yang J F , et al . Multimodal sentiment analysis with image-text interaction network [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 3375 - 3385 . DOI: 10.1109/tmm.2022.3160060 http://dx.doi.org/10.1109/tmm.2022.3160060
Zhu T , Li L D , Yang J F , et al . Multimodal emotion classification with multi-level semantic reasoning network [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 6868 - 6880 . DOI: 10.1109/tmm.2022.3214989 http://dx.doi.org/10.1109/tmm.2022.3214989
Nie W Z , Bao Y R , Zhao Y , et al . Long dialogue emotion detection based on commonsense knowledge graph guidance [J ] . IEEE Transactions on Multimedia , 2024 , 26 : 514 - 528 . DOI: 10.1109/tmm.2023.3267295 http://dx.doi.org/10.1109/tmm.2023.3267295
Wei L W , Hu D , Zhou W , et al . Modeling both intra- and inter-modality uncertainty for multimodal fake news detection [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 7906 - 7916 . DOI: 10.1109/tmm.2022.3229966 http://dx.doi.org/10.1109/tmm.2022.3229966
Liu K , Xue F , Guo D , et al . Multimodal graph contrastive learning for multimedia-based recommendation [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 9343 - 9355 . DOI: 10.1109/tmm.2023.3251108 http://dx.doi.org/10.1109/tmm.2023.3251108
Wen J Y , Qin F W , Du J , et al . MsgFusion: Medical semantic guided two-branch network for multimodal brain image fusion [J ] . IEEE Transactions on Multimedia , 2024 , 26 : 944 - 957 . DOI: 10.1109/tmm.2023.3273924 http://dx.doi.org/10.1109/tmm.2023.3273924
Poria S , Majumder N , Mihalcea R , et al . Emotion recognition in conversation: Research challenges, datasets, and recent advances [J ] . IEEE Access , 2019 , 7 : 100943 - 100953 . DOI: 10.1109/access.2019.2929050 http://dx.doi.org/10.1109/access.2019.2929050
Wang Y X , Liu M , Li Z , et al . Unlocking the power of multimodal learning for emotion recognition in conversation [C ] // Proceedings of the 31st ACM International Conference on Multimedia . New York : ACM , 2023 : 5947 - 5955 . DOI: 10.1145/3581783.3613846 http://dx.doi.org/10.1145/3581783.3613846
Wang P , Ganushchak L , Welie C , et al . The dynamic nature of emotions in language learning context: Theory, method, and analysis [J ] . Educational Psychology Review , 2024 , 36 ( 4 ): 105 . DOI: 10.1007/s10648-024-09946-2 http://dx.doi.org/10.1007/s10648-024-09946-2
Zhang D , Wu L Q , Sun C L , et al . Modeling both context- and speaker-sensitive dependence for emotion detection in multi-speaker conversations [C ] // Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence . International Joint Conferences on Artificial Intelligence Organization , 2019 : 5415 - 5421 . DOI: 10.24963/ijcai.2019/752 http://dx.doi.org/10.24963/ijcai.2019/752
Nie W Z , Chang R H , Ren M J , et al . I-GCN: Incremental graph convolution network for conversation emotion detection [J ] . IEEE Transactions on Multimedia , 2022 , 24 : 4471 - 4481 . DOI: 10.1109/tmm.2021.3118881 http://dx.doi.org/10.1109/tmm.2021.3118881
Joshi A , Bhat A , Jain A , et al . COGMEN: Contextualized GNN based multimodal emotion recognition [C ] // Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Stroudsburg : ACL , 2022 : 4148 - 4164 . DOI: 10.18653/v1/2022.naacl-main.306 http://dx.doi.org/10.18653/v1/2022.naacl-main.306
Hu J W , Liu Y C , Zhao J M , et al . MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation [C ] // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing . Stroudsburg : ACL , 2021 : 5666 - 5675 . DOI: 10.18653/v1/2021.acl-long.440 http://dx.doi.org/10.18653/v1/2021.acl-long.440
Li J , Wang X P , Lv G Q , et al . GraphMFT: A graph network based multimodal fusion technique for emotion recognition in conversation [J ] . Neurocomputing , 2023 , 550 : 126427 . DOI: 10.1016/j.neucom.2023.126427 http://dx.doi.org/10.1016/j.neucom.2023.126427
Li J , Wang X P , Lv G Q , et al . GraphCFC: A directed graph based cross-modal feature complementation approach for multimodal conversational emotion recognition [J ] . IEEE Transactions on Multimedia , 2024 , 26 : 77 - 89 . DOI: 10.1109/tmm.2023.3260635 http://dx.doi.org/10.1109/tmm.2023.3260635
Li J , Wang X P , Lv G Q , et al . GA2MIF: Graph and attention based two-stage multi-source information fusion for conversational emotion detection [J ] . IEEE Transactions on Affective Computing , 2024 , 15 ( 1 ): 130 - 143 . DOI: 10.1109/taffc.2023.3261279 http://dx.doi.org/10.1109/taffc.2023.3261279
Radford A , Kim J W , Hallacy C , et al . Learning transferable visual models from natural language supervision [C/OL ] // Proceedings of the 38th International Conference on Machine Learning , PMLR 139 , 2021 : 8748 - 8763 . https://proceedings.mlr.press/v139/radford21a https://proceedings.mlr.press/v139/radford21a .
Jia Chao , Yang Yinfei , Xia Ye , et al . Scaling up visual and vision-language representation learning with noisy text supervision [C/OL ] // Proceedings of the 38th International Conference on Machine Learning , PMLR 139 , 2021 : 4904 - 4916 . https://proceedings.mlr.press/v139/jia21b.html https://proceedings.mlr.press/v139/jia21b.html . DOI: 10.1145/3474085.3475622 http://dx.doi.org/10.1145/3474085.3475622
Dosovitskiy A , Beyer L , Kolesnikov A , et al . An image is worth 16 × 16 words: Transformers for image recognition at scale [PP/OL ] . V2.arXiv ( 2021-06-03 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.2010.11929 https://doi.org/10.48550/arXiv.2010.11929 .
Jaegle A , Borgeaud S , Alayrac J B , et al . Perceiver IO: A general architecture for structured inputs & outputs [PP/OL ] . V3.arXiv ( 2022-03-15 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.2107.14795 https://doi.org/10.48550/arXiv.2107.14795 .
Bao H B , Dong L , Piao S H , et al . BEiT: BERT pre-training of image transformers [PP/OL ] . V2.arXiv ( 2022-09-03 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.2106.08254 https://doi.org/10.48550/arXiv.2106.08254 .
He K M , Chen X L , Xie S N , et al . Masked autoencoders are scalable vision learners [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 15979 - 15988 . DOI: 10.1109/cvpr52688.2022.01553 http://dx.doi.org/10.1109/cvpr52688.2022.01553
Girdhar R , El-Nouby A , Liu Z , et al . ImageBind one embedding space to bind them all [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 15180 - 15190 . DOI: 10.1109/cvpr52729.2023.01457 http://dx.doi.org/10.1109/cvpr52729.2023.01457
Chen X , Zhang N Y , Li L , et al . Hybrid transformer with multi-level fusion for multimodal knowledge graph completion [C ] // Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval . New York : ACM , 2022 : 904 - 915 . DOI: 10.1145/3477495.3531992 http://dx.doi.org/10.1145/3477495.3531992
Zeng Y W , Jin Q , Bao T F , et al . Multi-modal knowledge hypergraph for diverse image retrieval [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2023 , 37 ( 3 ): 3376 - 3383 . DOI: 10.1609/aaai.v37i3.25445 http://dx.doi.org/10.1609/aaai.v37i3.25445
Jin W G , Yang K , Barzilay R , et al . Learning multimodal graph-to-graph translation for molecular optimization [PP/OL ] . V3.arXiv ( 2019-01-28 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.1812.01070 https://doi.org/10.48550/arXiv.1812.01070 .
Wang M L , Shao W , Huang S , et al . Hypergraph-regularized multimodal learning by graph diffusion for imaging genetics based Alzheimer’s Disease diagnosis [J ] . Medical Image Analysis , 2023 , 89 : 102883 . DOI: 10.1016/j.media.2023.102883 http://dx.doi.org/10.1016/j.media.2023.102883
Yoon M , Koh J Y , Hooi B , et al . Multimodal graph learning for generative tasks [PP/OL ] . V2.arXiv ( 2023-10-12 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.2310.07478 https://doi.org/10.48550/arXiv.2310.07478 .
Sahu G , Vechtomova O . Adaptive fusion techniques for multimodal data [PP/OL ] . V2.arXiv ( 2021-01-26 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.1911.03821 https://doi.org/10.48550/arXiv.1911.03821 .
Li S Z , Zhang T , Chen B N , et al . MIA-net: Multi-modal interactive attention network for multi-modal affective analysis [J ] . IEEE Transactions on Affective Computing , 2023 , 14 ( 4 ): 2796 - 2809 . DOI: 10.1109/taffc.2023.3259010 http://dx.doi.org/10.1109/taffc.2023.3259010
Zheng J H , Zhang S , Wang Z L , et al . Multi-channel weight-sharing autoencoder based on cascade multi-head attention for multimodal emotion recognition [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 2213 - 2225 . DOI: 10.1109/tmm.2022.3144885 http://dx.doi.org/10.1109/tmm.2022.3144885
Zhao S L , Liu Y C , Jiao Q , et al . Mitigating modality discrepancies for RGB-T semantic segmentation [J ] . IEEE Transactions on Neural Networks and Learning Systems , 2024 , 35 ( 7 ): 9380 - 9394 . DOI: 10.1109/tnnls.2022.3233089 http://dx.doi.org/10.1109/tnnls.2022.3233089
Hu D , Hou X L , Wei L W , et al . MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations [C ] // ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2022 : 7037 - 7041 . DOI: 10.1109/icassp43922.2022.9747397 http://dx.doi.org/10.1109/icassp43922.2022.9747397
Shi T , Huang S L . MultiEMO: An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations [C ] // Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics . Stroudsburg : ACL , 2023 : 14752 - 14766 . DOI: 10.18653/v1/2023.acl-long.824 http://dx.doi.org/10.18653/v1/2023.acl-long.824
Wang X , Liu N , Han H , et al . Self-supervised heterogeneous graph neural network with co-contrastive learning [C ] // Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining . New York : ACM , 2021 : 1726 - 1736 . DOI: 10.1145/3447548.3467415 http://dx.doi.org/10.1145/3447548.3467415
Yu J X , Li X . Heterogeneous graph contrastive learning with meta-path contexts and weighted negative samples [C ] // Proceedings of the 2023 SIAM International Conference on Data Mining (SDM) . Philadelphia : PASociety for Industrial and Applied Mathematics , 2023 : 37 - 45 . DOI: 10.1137/1.9781611977653.ch5 http://dx.doi.org/10.1137/1.9781611977653.ch5
Shen J M , Qiu W D , Meng Y , et al . TaxoClass: Hierarchical multi-label text classification using only class names [C ] // Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Stroudsburg : ACL , 2021 : 4239 - 4249 . DOI: 10.18653/v1/2021.naacl-main.335 http://dx.doi.org/10.18653/v1/2021.naacl-main.335
Devlin J , Chang M W , Lee K , et al . BERT: Pre-training of deep bidirectional transformers for language understanding [C ] // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Kerrville : Association for Computational Linguistics 2019 : 4171 - 4186 . DOI: 10.18653/v1/n19-1423 http://dx.doi.org/10.18653/v1/n19-1423
Busso C , Bulut M , Lee C C , et al . IEMOCAP: Interactive emotional dyadic motion capture database [J ] . Language Resources and Evaluation , 2008 , 42 ( 4 ): 335 - 359 . DOI: 10.1007/s10579-008-9076-6 http://dx.doi.org/10.1007/s10579-008-9076-6
Poria S , Hazarika D , Majumder N , et al . MELD: A multimodal multi-party dataset for emotion recognition in conversations [PP/OL ] . V6.arXiv ( 2019-06-04 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.1810.02508 https://doi.org/10.48550/arXiv.1810.02508 .
Zadeh A , Zellers R , Pincus E , et al . Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos [PP/OL ] . V2.arXiv ( 2016-08-12 )[ 2025-09-05 ] . https://arxiv.org/abs/1606.06259 https://arxiv.org/abs/1606.06259 . DOI: 10.1109/mis.2016.94 http://dx.doi.org/10.1109/mis.2016.94
Majumder N , Poria S , Hazarika D , et al . DialogueRNN: An attentive RNN for emotion detection in conversations [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2019 , 33 ( 1 ): 6818 - 6825 . DOI: 10.1609/aaai.v33i01.33016818 http://dx.doi.org/10.1609/aaai.v33i01.33016818
Ghosal D , Majumder N , Poria S , et al . DialogueGCN: A graph convolutional neural network for emotion recognition in conversation [PP/OL ] . V1.arXiv ( 2019-08-30 )[ 2025-09-05 ] . https://doi.org/10.48550/arXiv.1908.11540 https://doi.org/10.48550/arXiv.1908.11540 .
Ma H , Wang J , Lin H F , et al . A transformer-based model with self-distillation for multimodal emotion recognition in conversations [J ] . IEEE Transactions on Multimedia , 2024 , 26 : 776 - 788 . DOI: 10.1109/tmm.2023.3271019 http://dx.doi.org/10.1109/tmm.2023.3271019
Zhang C W , Zhang Y H , Cheng B . RL-EMO: A reinforcement learning framework for multimodal emotion recognition [C ] // ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2024 : 10246 - 10250 . DOI: 10.1109/icassp48485.2024.10446459 http://dx.doi.org/10.1109/icassp48485.2024.10446459
0
浏览量
20
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621