基于注意力惩罚和自适应学习的场景图增强联合多模态方面情感分析

黄辰; 刘会杰; 张龑; 杨超; 宋建华

doi:10.12263/DZXB.20250543

您当前的位置：

首页 >

文章列表页 >

基于注意力惩罚和自适应学习的场景图增强联合多模态方面情感分析

学术论文 | 更新时间：2026-05-28

- 基于注意力惩罚和自适应学习的场景图增强联合多模态方面情感分析
- Attention Penalty and Adaptive Learning Scene Graph for Joint Multimodal Aspect-Based Sentiment Analysis
- 电子学报 2026年54卷第2期页码：851-861
- 作者机构：
  
  1.湖北大学计算机学院，湖北武汉 430062
  2.智能感知系统与安全教育部重点实验室，湖北武汉 430062
  3.大数据智能分析与行业应用湖北省重点实验室，湖北武汉 430062
  4.湖北省高校人文社科重点研究基地-绩效评价信息管理研究中心，湖北武汉 430062
  5.湖北大学网络空间安全学院，湖北武汉 430062
- 作者简介：
  
  [ "黄辰男，1983年8月生，福建龙岩人。现为湖北大学计算机学院教授。主要研究方向为人工智能、脑机接口。E-mail: huang@hubu.edu.cn" ]
  [ "刘会杰男，2000年2月生，湖北黄石人。现为湖北大学计算机学院硕士研究生。主要研究方向为人工智能与脑科学、情感分析。E-mail: liuhj@stu.hubu.edu.cn" ]
  [ "张龑男，1974年6月生，湖北宜昌人。现为湖北大学计算机学院教授。主要研究方向为信息安全、大数据分析。中国电子学会会员编号：E190197582M。E-mail: zhangyan@hubu.edu.cn" ]
  [ "杨超男，1982年9月生，湖北武汉人。现为湖北大学计算机学院教授。主要研究方向为智能计算、信息安全等。E-mail: stevenyc@hubu.edu.cn" ]
  [ "宋建华女，1973年3月生，湖北襄阳人。现为湖北大学网络空间安全学院教授。主要研究方向为网络与信息安全。E-mail: sjhhubu@126.com" ]
- 基金信息：
  
  湖北省重大攻关项目（JD）(2023BAA018);湖北省科技计划重大科技专项(2024BAA008);武汉市知识创新专项(202311901251001)
- DOI：10.12263/DZXB.20250543
  中图分类号： TP391;TP399
- 收稿：2025-06-23，
  
  录用：2026-01-28，
  
  纸质出版：2026-02-25
- 稿件说明：
移动端阅览
黄辰, 刘会杰, 张龑, 等. 基于注意力惩罚和自适应学习的场景图增强联合多模态方面情感分析[J]. 电子学报, 2026, 54(02): 851-861.

HUANG Chen, LIU Huijie, ZHANG Yan, et al. Attention Penalty and Adaptive Learning Scene Graph for Joint Multimodal Aspect-Based Sentiment Analysis[J]. Acta Electronica Sinica, 2026, 54(02): 851-861.
黄辰, 刘会杰, 张龑, 等. 基于注意力惩罚和自适应学习的场景图增强联合多模态方面情感分析[J]. 电子学报, 2026, 54(02): 851-861. DOI：10.12263/DZXB.20250543

HUANG Chen, LIU Huijie, ZHANG Yan, et al. Attention Penalty and Adaptive Learning Scene Graph for Joint Multimodal Aspect-Based Sentiment Analysis[J]. Acta Electronica Sinica, 2026, 54(02): 851-861. DOI：10.12263/DZXB.20250543

摘要

联合多模态方面情感分析（Joint Multimodal Aspect Sentiment Analysis，JMASA）作为细粒度情感分析领域的重要研究方向，旨在从图像-文本对中联合识别具体的方面术语及其对应的情感极性，近年来受到了越来越多的关注。尽管该任务在社交媒体分析、产品评论挖掘等领域具有重要应用价值，然而，现有方法主要面临两个方面的挑战：一是在利用预训练语言模型融合多模态信息时，模型常对部分无关的视觉或文本标记产生注意力过度信任问题，即分配了不合常理的高注意力分数，干扰了对关键情感线索的捕捉；二是现有方法难以显式地建模图像内部对象间的复杂关系，也缺乏有效机制来挖掘图像与文本之间在对象级别的深度语义交互与依赖。为了解决上述问题，本文提出了一种基于注意力惩罚和自适应学习的场景图增强联合多模态方面情感分析方法（Attention Penalty and Adaptive Learning Scene Graph，APALSG），并利用场景图生成（Scene Graph Generation，SGG）来增强联合多模态方面情感分析。具体来说，该方法主要通过专门设计的注意力惩罚策略对超过预设阈值的高注意力分数进行惩罚性衰减，并将衰减的注意力值重新分配给其上下文窗口内的相邻标记。该策略动态调整了模型的注意力分布，有效缓解了对无关信息的过度关注，从而提取出更精准的关键对象特征。此外，设计场景图建模模块，结合图卷积网络（Graph Convolutional Network，GCN）在该场景图上进行消息传播与聚合，获得包含丰富对象间关系上下文信息的视觉表示。最后，还设计了一种自适应学习策略，使模型能够自适应地聚焦于图像-文本对之间与当前方面相关的潜在依赖关系，实现深度的跨模态对齐与融合，并将融合后的多模态特征送入一个分类器，以同时完成方面术语提取和情感分类的联合预测。为全面验证APALSG的有效性，在多个公开可用的基准数据集上的实验结果表明，APALSG在性能上显著优于现有最先进的方法，验证了其有效性。与现有的JMASA模型相比，APALSG在Twitter-2015、Twitter-2017和MACSA数据集上表现优异，精确率分别提高了1.46%、2.18%和1.19%。

Abstract

Joint multimodal aspect-based sentiment analysis (JMASA)

a crucial research direction in fine-grained sentiment analysis

aims to jointly identify specific aspect terms and their corresponding sentiment polarities from image-text pairs

and has garnered increasing attention in recent years. Although this task holds significant application value in areas such as social media analysis and product review mining

existing methods primarily face two challenges: First

when leveraging pre-trained language models to fuse multimodal information

models often exhibit excessive trust in certain irrelevant visual or textual tokens

allocating unnaturally high attention scores

which interferes with capturing key emotional cues; Second

existing methods struggle to explicitly model the complex relationships between objects within an image and lack an effective mechanism to mine deep semantic interactions and dependencies at the object level between images and text. To address these issues

this study proposes a novel scene graph-enhanced method for joint multimodal aspect-based sentiment analysis based on attention penalty and adaptive learning

named APALSG

which utilizes scene graph generation (SGG) to enhance the analysis. Specifically

the method primarily employs a specially designed attention penalty strategy to penalize and attenuate high attention scores that exceed a predefined threshold

redistributing the attenuated attention values to neighboring tokens within their contextual window. This strategy dynamically adjusts the model’s attention distribution

effectively mitigating the over-focus on irrelevant information

thereby extracting more precise key object features. Furthermore

a scene graph modeling module is designed

which incorporates graph convolutional networks (GCN) to perform message propagation and aggregation on this scene graph

obtaining visual representations enriched with contextual information about inter-object relationships. Finally

an adaptive learning strategy is also designed

enabling the model to adaptively focus on the potential dependencies between the image-text pair relevant to the current aspect

achieving deep cross-modal alignment and fusion. The fused multimodal features are then fed into a classifier to simultaneously perform joint prediction for aspect term extraction and sentiment classification. To comprehensively validate the effectiveness of APALSG

experimental results on multiple publicly available benchmark datasets demonstrate that APALSG significantly outperforms existing state-of-the-art methods

confirming its efficacy. Compared to existing JMASA models

APALSG shows superior performance on the Twitter-2015

Twitter-2017

and MACSA datasets

improving precision by 1.46%

2.18%

and 1.19% respectively.

关键词

Keywords

references

Xu N , Mao W J , Chen G D . Multi-interactive memory network for aspect based multimodal sentiment analysis [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2019 , 33 ( 1 ): 371 - 378 . DOI: 10.1609/aaai.v33i01.3301371 http://dx.doi.org/10.1609/aaai.v33i01.3301371

张换香 , 彭俊杰 . 基于方面级情感分析的深度语义挖掘模型 [J ] . 电子学报 , 2024 , 52 ( 7 ): 2307 - 2319 .

Zhang Huanxiang , Peng Junjie . A deep semantic mining model based on aspect-LevelSentiment analysis [J ] . Acta Electronica Sinica , 2024 , 52 ( 7 ): 2307 - 2319 . (in Chinese)

Yang X C , Feng S , Wang D L , et al . Few-shot joint multimodal aspect-sentiment analysis based on generative multimodal prompt [PP/OL ] . V2.arXiv ( 2023-05-18 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.2305.10169 https://doi.org/10.48550/arXiv.2305.10169 .

Gao M , Zheng H F , Feng X X , et al . Multimodal fusion using multi-view domains for data heterogeneity in federated learning [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2025 , 39 ( 16 ): 16736 - 16744 . DOI: 10.1609/aaai.v39i16.33839 http://dx.doi.org/10.1609/aaai.v39i16.33839

Chen Q Y , Jin X , Wang Y N , et al . Graph-based unsupervised disentangled representation learning via multimodal large language models [C ] // Advances in Neural Information Processing Systems 37 . Neural Information Processing Systems Foundation, Inc. (NeurIPS) , 2024 : 103101 - 103130 . DOI: 10.52202/079017-3276 http://dx.doi.org/10.52202/079017-3276

Huang Q D , Dong X Y , Zhang P , et al . OPERA: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 13418 - 13427 . DOI: 10.1109/cvpr52733.2024.01274 http://dx.doi.org/10.1109/cvpr52733.2024.01274

Li Z , Xu B , Zhu C H , et al . CLMLF: A contrastive learning and multi-layer fusion method for multimodal sentiment detection [PP/OL ] . V4. arXiv ( 2022-06-14 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.2204.05515 https://doi.org/10.48550/arXiv.2204.05515 .

Xiao L W , Mao R , Zhao S , et al . Exploring cognitive and aesthetic causality for multimodal aspect-based sentiment analysis [J ] . IEEE Transactions on Affective Computing , 2025 , 16 ( 4 ): 3248 - 3265 . DOI: 10.1109/taffc.2025.3565506 http://dx.doi.org/10.1109/taffc.2025.3565506

Xu D F , Zhu Y K , Choy C B , et al . Scene graph generation by iterative message passing [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2017 : 3097 - 3106 . DOI: 10.1109/cvpr.2017.330 http://dx.doi.org/10.1109/cvpr.2017.330

Yoon S , Kang W Y , Jeon S , et al . Image-to-image retrieval by learning similarity between scene graphs [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 12 ): 10718 - 10726 . DOI: 10.1609/aaai.v35i12.17281 http://dx.doi.org/10.1609/aaai.v35i12.17281

Wang Z C , You H X , Li L H , et al . SGEITL: Scene graph enhanced image-text learning for visual commonsense reasoning [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 5 ): 5914 - 5922 . DOI: 10.1609/aaai.v36i5.20536 http://dx.doi.org/10.1609/aaai.v36i5.20536

Fu Z , Feng J H , Zheng C M , et al . Knowledge-enhanced scene graph generation with multimodal relation alignment (student abstract) [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 11 ): 12947 - 12948 . DOI: 10.1609/aaai.v36i11.21610 http://dx.doi.org/10.1609/aaai.v36i11.21610

Wang Z , Liu Y , Yang J N . BERT-based multimodal aspect-level sentiment analysis for social media [C ] // Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition . New York : ACM , 2023 : 187 - 192 . DOI: 10.1145/3573942.3573971 http://dx.doi.org/10.1145/3573942.3573971

Yu J F , Jiang J , Xia R . Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification [J ] . IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2020 , 28 : 429 - 439 . DOI: 10.1109/taslp.2019.2957872 http://dx.doi.org/10.1109/taslp.2019.2957872

Hu M H , Peng Y X , Huang Z , et al . Open-domain targeted sentiment analysis via span-based extraction and classification [PP/OL ] . V1. arXiv ( 2019-06-10 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.1906.03820 https://doi.org/10.48550/arXiv.1906.03820 .

Tang K H , Niu Y L , Huang J Q , et al . Unbiased scene graph generation from biased training [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 3713 - 3722 . DOI: 10.1109/cvpr42600.2020.00377 http://dx.doi.org/10.1109/cvpr42600.2020.00377

Rakhmatillayevich A M . GLove: Global vectors for word representation [J ] . American Journal of Multidisciplinary Bulletin , 2025 , 3 ( 5 ): 359 - 364 .

Yu J , Jiang J . Adapting BERT for target-oriented multimodal sentiment classification [C ] . IJCAI , 2019 . DOI: 10.24963/ijcai.2019/751 http://dx.doi.org/10.24963/ijcai.2019/751

Yang H , Si Z M , Zhao Y Y , et al . MACSA: A multimodal aspect-category sentiment analysis dataset with multimodal fine-grained aligned annotations [J ] . Multimedia Tools and Applications , 2024 , 83 ( 34 ): 81279 - 81297 . DOI: 10.1007/s11042-024-18796-7 http://dx.doi.org/10.1007/s11042-024-18796-7

Zhou R , Guo W Y , Liu X M , et al . AoM: Detecting aspect-oriented information for multimodal aspect-based sentiment analysis [PP/OL ] . V1. arXiv ( 2023-05-31 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.2306.01004 https://doi.org/10.48550/arXiv.2306.01004 .

Truong Q T , Lauw H W . VistaNet: Visual aspect attention network for multimodal sentiment analysis [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2019 , 33 ( 1 ): 305 - 312 . DOI: 10.1609/aaai.v33i01.3301305 http://dx.doi.org/10.1609/aaai.v33i01.3301305

Liu Y H , Ott M , Goyal N , et al . RoBERTa: A robustly optimized BERT pretraining approach [PP/OL ] . V1. arXiv ( 2019-07-26 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.1907.11692 https://doi.org/10.48550/arXiv.1907.11692 .

Chen G M , Tian Y H , Song Y . Joint aspect extraction and sentiment analysis with directional graph convolutional networks [C ] // Proceedings of the 28th International Conference on Computational Linguistics . International Committee on Computational Linguistics , 2020 : 272 - 279 . DOI: 10.18653/v1/2020.coling-main.24 http://dx.doi.org/10.18653/v1/2020.coling-main.24

Yu J F , Jiang J , Yang L , et al . Improving multimodal named entity recognition via entity span detection with unified multimodal transformer [C ] // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : ACL , 2020 : 3342 - 3352 . DOI: 10.18653/v1/2020.acl-main.306 http://dx.doi.org/10.18653/v1/2020.acl-main.306

Wu Z W , Zheng C M , Cai Y , et al . Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts [C ] // Proceedings of the 28th ACM International Conference on Multimedia . New York : ACM , 2020 : 1038 - 1046 . DOI: 10.1145/3394171.3413650 http://dx.doi.org/10.1145/3394171.3413650

Sun L , Wang J Q , Zhang K , et al . RpBERT: A text-image relation propagation-based BERT model for multimodal NER [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 15 ): 13860 - 13868 . DOI: 10.1609/aaai.v35i15.17633 http://dx.doi.org/10.1609/aaai.v35i15.17633

Ling Y , Yu J F , Xia R . Vision-language pre-training for multimodal aspect-based sentiment analysis [PP/OL ] . V2. arXiv ( 2022-04-21 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.2204.07955 https://doi.org/10.48550/arXiv.2204.07955 .

Yang L , Na J C , Yu J F . Cross-modal multitask transformer for end-to-end multimodal aspect-based sentiment analysis [J ] . Information Processing & Management , 2022 , 59 ( 5 ): 103038 . DOI: 10.1016/j.ipm.2022.103038 http://dx.doi.org/10.1016/j.ipm.2022.103038

Zhu L , Sun H , Gao Q , et al . Joint multimodal aspect sentiment analysis with aspect enhancement and syntactic adaptive learning [C ] // Proceedings of the Thirty-ThirdInternational Joint Conference on Artificial Intelligence . 2024 .

Sun H , Niu Z W , Wang H Y , et al . Multimodal sentiment analysis with mutual information-based disentangled representation learning [J ] . IEEE Transactions on Affective Computing , 2025 , 16 ( 3 ): 1606 - 1617 . DOI: 10.1109/taffc.2025.3529732 http://dx.doi.org/10.1109/taffc.2025.3529732

Li J B , Liu R Y , Miao Q G , et al . CAETFN: Context adaptively enhanced text-guided fusion network for multimodal sentiment analysis [J ] . IEEE Transactions on Affective Computing , 2025 , 16 ( 4 ): 3122 - 3138 . DOI: 10.1109/taffc.2025.3590246 http://dx.doi.org/10.1109/taffc.2025.3590246

Wu H Q , Cheng S L , Wang J J , et al . Multimodal aspect extraction with region-aware alignment network [M ] // Natural language processing and Chinese computing . ChamSpringer International Publishing , 2020 : 145 - 156 . DOI: 10.1007/978-3-030-60450-9_12 http://dx.doi.org/10.1007/978-3-030-60450-9_12

Yan H , Dai J Q , ji T , et al . A unified generative framework for aspect-based sentiment analysis [PP/OL ] . V1. arXiv ( 2021-06-08 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.2106.04300 https://doi.org/10.48550/arXiv.2106.04300 .

Yang H , Zhao Y Y , Qin B . Face-sensitive image-to-emotional-text cross-modal translation for multimodal aspect-based sentiment analysis [C ] // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing . Stroudsburg : ACL , 2022 : 3324 - 3335 . DOI: 10.18653/v1/2022.emnlp-main.219 http://dx.doi.org/10.18653/v1/2022.emnlp-main.219

Khan Z , Fu Y . Exploiting BERT for multimodal target sentiment classification through input space translation [C ] // Proceedings of the 29th ACM International Conference on Multimedia . New York : ACM , 2021 : 3034 - 3042 . DOI: 10.1145/3474085.3475692 http://dx.doi.org/10.1145/3474085.3475692

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于VLM凸优化的网络直播视频场景图生成

基于增强时空图卷积网络的骨架行为识别

融合自注意力机制的多行为图对比学习推荐方法

基于传播树的多特征谣言检测方法

加密流量侧信道泄漏的不可避免性