1.湖北大学计算机学院,湖北武汉 430062
2.智能感知系统与安全教育部重点实验室,湖北武汉 430062
3.大数据智能分析与行业应用湖北省重点实验室,湖北武汉 430062
4.湖北省高校人文社科重点研究基地-绩效评价信息管理研究中心,湖北武汉 430062
5.湖北大学网络空间安全学院,湖北武汉 430062
[ "黄辰 男,1983年8月生,福建龙岩人。现为湖北大学计算机学院教授。主要研究方向为人工智能、脑机接口。E-mail: huang@hubu.edu.cn" ]
[ "刘会杰 男,2000年2月生,湖北黄石人。现为湖北大学计算机学院硕士研究生。主要研究方向为人工智能与脑科学、情感分析。E-mail: liuhj@stu.hubu.edu.cn" ]
[ "张龑 男,1974年6月生,湖北宜昌人。现为湖北大学计算机学院教授。主要研究方向为信息安全、大数据分析。中国电子学会会员编号:E190197582M。E-mail: zhangyan@hubu.edu.cn" ]
[ "杨超 男,1982年9月生,湖北武汉人。现为湖北大学计算机学院教授。主要研究方向为智能计算、信息安全等。E-mail: stevenyc@hubu.edu.cn" ]
[ "宋建华 女,1973年3月生,湖北襄阳人。现为湖北大学网络空间安全学院教授。主要研究方向为网络与信息安全。E-mail: sjhhubu@126.com" ]
收稿:2025-06-23,
录用:2026-01-28,
纸质出版:2026-02-25
移动端阅览
黄辰, 刘会杰, 张龑, 等. 基于注意力惩罚和自适应学习的场景图增强联合多模态方面情感分析[J]. 电子学报, 2026, 54(02): 851-861.
HUANG Chen, LIU Huijie, ZHANG Yan, et al. Attention Penalty and Adaptive Learning Scene Graph for Joint Multimodal Aspect-Based Sentiment Analysis[J]. Acta Electronica Sinica, 2026, 54(02): 851-861.
黄辰, 刘会杰, 张龑, 等. 基于注意力惩罚和自适应学习的场景图增强联合多模态方面情感分析[J]. 电子学报, 2026, 54(02): 851-861. DOI:10.12263/DZXB.20250543
HUANG Chen, LIU Huijie, ZHANG Yan, et al. Attention Penalty and Adaptive Learning Scene Graph for Joint Multimodal Aspect-Based Sentiment Analysis[J]. Acta Electronica Sinica, 2026, 54(02): 851-861. DOI:10.12263/DZXB.20250543
联合多模态方面情感分析(Joint Multimodal Aspect Sentiment Analysis,JMASA)作为细粒度情感分析领域的重要研究方向,旨在从图像-文本对中联合识别具体的方面术语及其对应的情感极性,近年来受到了越来越多的关注。尽管该任务在社交媒体分析、产品评论挖掘等领域具有重要应用价值,然而,现有方法主要面临两个方面的挑战:一是在利用预训练语言模型融合多模态信息时,模型常对部分无关的视觉或文本标记产生注意力过度信任问题,即分配了不合常理的高注意力分数,干扰了对关键情感线索的捕捉;二是现有方法难以显式地建模图像内部对象间的复杂关系,也缺乏有效机制来挖掘图像与文本之间在对象级别的深度语义交互与依赖。为了解决上述问题,本文提出了一种基于注意力惩罚和自适应学习的场景图增强联合多模态方面情感分析方法(Attention Penalty and Adaptive Learning Scene Graph,APALSG),并利用场景图生成(Scene Graph Generation,SGG)来增强联合多模态方面情感分析。具体来说,该方法主要通过专门设计的注意力惩罚策略对超过预设阈值的高注意力分数进行惩罚性衰减,并将衰减的注意力值重新分配给其上下文窗口内的相邻标记。该策略动态调整了模型的注意力分布,有效缓解了对无关信息的过度关注,从而提取出更精准的关键对象特征。此外,设计场景图建模模块,结合图卷积网络(Graph Convolutional Network,GCN)在该场景图上进行消息传播与聚合,获得包含丰富对象间关系上下文信息的视觉表示。最后,还设计了一种自适应学习策略,使模型能够自适应地聚焦于图像-文本对之间与当前方面相关的潜在依赖关系,实现深度的跨模态对齐与融合,并将融合后的多模态特征送入一个分类器,以同时完成方面术语提取和情感分类的联合预测。为全面验证APALSG的有效性,在多个公开可用的基准数据集上的实验结果表明,APALSG在性能上显著优于现有最先进的方法,验证了其有效性。与现有的JMASA模型相比,APALSG在Twitter-2015、Twitter-2017和MACSA数据集上表现优异,精确率分别提高了1.46%、2.18%和1.19%。
Joint multimodal aspect-based sentiment analysis (JMASA)
a crucial research direction in fine-grained sentiment analysis
aims to jointly identify specific aspect terms and their corresponding sentiment polarities from image-text pairs
and has garnered increasing attention in recent years. Although this task holds significant application value in areas such as social media analysis and product review mining
existing methods primarily face two challenges: First
when leveraging pre-trained language models to fuse multimodal information
models often exhibit excessive trust in certain irrelevant visual or textual tokens
allocating unnaturally high attention scores
which interferes with capturing key emotional cues; Second
existing methods struggle to explicitly model the complex relationships between objects within an image and lack an effective mechanism to mine deep semantic interactions and dependencies at the object level between images and text. To address these issues
this study proposes a novel scene graph-enhanced method for joint multimodal aspect-based sentiment analysis based on attention penalty and adaptive learning
named APALSG
which utilizes scene graph generation (SGG) to enhance the analysis. Specifically
the method primarily employs a specially designed attention penalty strategy to penalize and attenuate high attention scores that exceed a predefined threshold
redistributing the attenuated attention values to neighboring tokens within their contextual window. This strategy dynamically adjusts the model’s attention distribution
effectively mitigating the over-focus on irrelevant information
thereby extracting more precise key object features. Furthermore
a scene graph modeling module is designed
which incorporates graph convolutional networks (GCN) to perform message propagation and aggregation on this scene graph
obtaining visual representations enriched with contextual information about inter-object relationships. Finally
an adaptive learning strategy is also designed
enabling the model to adaptively focus on the potential dependencies between the image-text pair relevant to the current aspect
achieving deep cross-modal alignment and fusion. The fused multimodal features are then fed into a classifier to simultaneously perform joint prediction for aspect term extraction and sentiment classification. To comprehensively validate the effectiveness of APALSG
experimental results on multiple publicly available benchmark datasets demonstrate that APALSG significantly outperforms existing state-of-the-art methods
confirming its efficacy. Compared to existing JMASA models
APALSG shows superior performance on the Twitter-2015
Twitter-2017
and MACSA datasets
improving precision by 1.46%
2.18%
and 1.19% respectively.
Xu N , Mao W J , Chen G D . Multi-interactive memory network for aspect based multimodal sentiment analysis [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2019 , 33 ( 1 ): 371 - 378 . DOI: 10.1609/aaai.v33i01.3301371 http://dx.doi.org/10.1609/aaai.v33i01.3301371
张换香 , 彭俊杰 . 基于方面级情感分析的深度语义挖掘模型 [J ] . 电子学报 , 2024 , 52 ( 7 ): 2307 - 2319 .
Zhang Huanxiang , Peng Junjie . A deep semantic mining model based on aspect-LevelSentiment analysis [J ] . Acta Electronica Sinica , 2024 , 52 ( 7 ): 2307 - 2319 . (in Chinese)
Yang X C , Feng S , Wang D L , et al . Few-shot joint multimodal aspect-sentiment analysis based on generative multimodal prompt [PP/OL ] . V2.arXiv ( 2023-05-18 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.2305.10169 https://doi.org/10.48550/arXiv.2305.10169 .
Gao M , Zheng H F , Feng X X , et al . Multimodal fusion using multi-view domains for data heterogeneity in federated learning [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2025 , 39 ( 16 ): 16736 - 16744 . DOI: 10.1609/aaai.v39i16.33839 http://dx.doi.org/10.1609/aaai.v39i16.33839
Chen Q Y , Jin X , Wang Y N , et al . Graph-based unsupervised disentangled representation learning via multimodal large language models [C ] // Advances in Neural Information Processing Systems 37 . Neural Information Processing Systems Foundation, Inc. (NeurIPS) , 2024 : 103101 - 103130 . DOI: 10.52202/079017-3276 http://dx.doi.org/10.52202/079017-3276
Huang Q D , Dong X Y , Zhang P , et al . OPERA: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 13418 - 13427 . DOI: 10.1109/cvpr52733.2024.01274 http://dx.doi.org/10.1109/cvpr52733.2024.01274
Li Z , Xu B , Zhu C H , et al . CLMLF: A contrastive learning and multi-layer fusion method for multimodal sentiment detection [PP/OL ] . V4. arXiv ( 2022-06-14 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.2204.05515 https://doi.org/10.48550/arXiv.2204.05515 .
Xiao L W , Mao R , Zhao S , et al . Exploring cognitive and aesthetic causality for multimodal aspect-based sentiment analysis [J ] . IEEE Transactions on Affective Computing , 2025 , 16 ( 4 ): 3248 - 3265 . DOI: 10.1109/taffc.2025.3565506 http://dx.doi.org/10.1109/taffc.2025.3565506
Xu D F , Zhu Y K , Choy C B , et al . Scene graph generation by iterative message passing [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2017 : 3097 - 3106 . DOI: 10.1109/cvpr.2017.330 http://dx.doi.org/10.1109/cvpr.2017.330
Yoon S , Kang W Y , Jeon S , et al . Image-to-image retrieval by learning similarity between scene graphs [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 12 ): 10718 - 10726 . DOI: 10.1609/aaai.v35i12.17281 http://dx.doi.org/10.1609/aaai.v35i12.17281
Wang Z C , You H X , Li L H , et al . SGEITL: Scene graph enhanced image-text learning for visual commonsense reasoning [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 5 ): 5914 - 5922 . DOI: 10.1609/aaai.v36i5.20536 http://dx.doi.org/10.1609/aaai.v36i5.20536
Fu Z , Feng J H , Zheng C M , et al . Knowledge-enhanced scene graph generation with multimodal relation alignment (student abstract) [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 11 ): 12947 - 12948 . DOI: 10.1609/aaai.v36i11.21610 http://dx.doi.org/10.1609/aaai.v36i11.21610
Wang Z , Liu Y , Yang J N . BERT-based multimodal aspect-level sentiment analysis for social media [C ] // Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition . New York : ACM , 2023 : 187 - 192 . DOI: 10.1145/3573942.3573971 http://dx.doi.org/10.1145/3573942.3573971
Yu J F , Jiang J , Xia R . Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification [J ] . IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2020 , 28 : 429 - 439 . DOI: 10.1109/taslp.2019.2957872 http://dx.doi.org/10.1109/taslp.2019.2957872
Hu M H , Peng Y X , Huang Z , et al . Open-domain targeted sentiment analysis via span-based extraction and classification [PP/OL ] . V1. arXiv ( 2019-06-10 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.1906.03820 https://doi.org/10.48550/arXiv.1906.03820 .
Tang K H , Niu Y L , Huang J Q , et al . Unbiased scene graph generation from biased training [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 3713 - 3722 . DOI: 10.1109/cvpr42600.2020.00377 http://dx.doi.org/10.1109/cvpr42600.2020.00377
Rakhmatillayevich A M . GLove: Global vectors for word representation [J ] . American Journal of Multidisciplinary Bulletin , 2025 , 3 ( 5 ): 359 - 364 .
Yu J , Jiang J . Adapting BERT for target-oriented multimodal sentiment classification [C ] . IJCAI , 2019 . DOI: 10.24963/ijcai.2019/751 http://dx.doi.org/10.24963/ijcai.2019/751
Yang H , Si Z M , Zhao Y Y , et al . MACSA: A multimodal aspect-category sentiment analysis dataset with multimodal fine-grained aligned annotations [J ] . Multimedia Tools and Applications , 2024 , 83 ( 34 ): 81279 - 81297 . DOI: 10.1007/s11042-024-18796-7 http://dx.doi.org/10.1007/s11042-024-18796-7
Zhou R , Guo W Y , Liu X M , et al . AoM: Detecting aspect-oriented information for multimodal aspect-based sentiment analysis [PP/OL ] . V1. arXiv ( 2023-05-31 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.2306.01004 https://doi.org/10.48550/arXiv.2306.01004 .
Truong Q T , Lauw H W . VistaNet: Visual aspect attention network for multimodal sentiment analysis [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2019 , 33 ( 1 ): 305 - 312 . DOI: 10.1609/aaai.v33i01.3301305 http://dx.doi.org/10.1609/aaai.v33i01.3301305
Liu Y H , Ott M , Goyal N , et al . RoBERTa: A robustly optimized BERT pretraining approach [PP/OL ] . V1. arXiv ( 2019-07-26 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.1907.11692 https://doi.org/10.48550/arXiv.1907.11692 .
Chen G M , Tian Y H , Song Y . Joint aspect extraction and sentiment analysis with directional graph convolutional networks [C ] // Proceedings of the 28th International Conference on Computational Linguistics . International Committee on Computational Linguistics , 2020 : 272 - 279 . DOI: 10.18653/v1/2020.coling-main.24 http://dx.doi.org/10.18653/v1/2020.coling-main.24
Yu J F , Jiang J , Yang L , et al . Improving multimodal named entity recognition via entity span detection with unified multimodal transformer [C ] // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : ACL , 2020 : 3342 - 3352 . DOI: 10.18653/v1/2020.acl-main.306 http://dx.doi.org/10.18653/v1/2020.acl-main.306
Wu Z W , Zheng C M , Cai Y , et al . Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts [C ] // Proceedings of the 28th ACM International Conference on Multimedia . New York : ACM , 2020 : 1038 - 1046 . DOI: 10.1145/3394171.3413650 http://dx.doi.org/10.1145/3394171.3413650
Sun L , Wang J Q , Zhang K , et al . RpBERT: A text-image relation propagation-based BERT model for multimodal NER [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 15 ): 13860 - 13868 . DOI: 10.1609/aaai.v35i15.17633 http://dx.doi.org/10.1609/aaai.v35i15.17633
Ling Y , Yu J F , Xia R . Vision-language pre-training for multimodal aspect-based sentiment analysis [PP/OL ] . V2. arXiv ( 2022-04-21 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.2204.07955 https://doi.org/10.48550/arXiv.2204.07955 .
Yang L , Na J C , Yu J F . Cross-modal multitask transformer for end-to-end multimodal aspect-based sentiment analysis [J ] . Information Processing & Management , 2022 , 59 ( 5 ): 103038 . DOI: 10.1016/j.ipm.2022.103038 http://dx.doi.org/10.1016/j.ipm.2022.103038
Zhu L , Sun H , Gao Q , et al . Joint multimodal aspect sentiment analysis with aspect enhancement and syntactic adaptive learning [C ] // Proceedings of the Thirty-ThirdInternational Joint Conference on Artificial Intelligence . 2024 .
Sun H , Niu Z W , Wang H Y , et al . Multimodal sentiment analysis with mutual information-based disentangled representation learning [J ] . IEEE Transactions on Affective Computing , 2025 , 16 ( 3 ): 1606 - 1617 . DOI: 10.1109/taffc.2025.3529732 http://dx.doi.org/10.1109/taffc.2025.3529732
Li J B , Liu R Y , Miao Q G , et al . CAETFN: Context adaptively enhanced text-guided fusion network for multimodal sentiment analysis [J ] . IEEE Transactions on Affective Computing , 2025 , 16 ( 4 ): 3122 - 3138 . DOI: 10.1109/taffc.2025.3590246 http://dx.doi.org/10.1109/taffc.2025.3590246
Wu H Q , Cheng S L , Wang J J , et al . Multimodal aspect extraction with region-aware alignment network [M ] // Natural language processing and Chinese computing . ChamSpringer International Publishing , 2020 : 145 - 156 . DOI: 10.1007/978-3-030-60450-9_12 http://dx.doi.org/10.1007/978-3-030-60450-9_12
Yan H , Dai J Q , ji T , et al . A unified generative framework for aspect-based sentiment analysis [PP/OL ] . V1. arXiv ( 2021-06-08 )[ 2025-06-23 ] . https://doi.org/10.48550/arXiv.2106.04300 https://doi.org/10.48550/arXiv.2106.04300 .
Yang H , Zhao Y Y , Qin B . Face-sensitive image-to-emotional-text cross-modal translation for multimodal aspect-based sentiment analysis [C ] // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing . Stroudsburg : ACL , 2022 : 3324 - 3335 . DOI: 10.18653/v1/2022.emnlp-main.219 http://dx.doi.org/10.18653/v1/2022.emnlp-main.219
Khan Z , Fu Y . Exploiting BERT for multimodal target sentiment classification through input space translation [C ] // Proceedings of the 29th ACM International Conference on Multimedia . New York : ACM , 2021 : 3034 - 3042 . DOI: 10.1145/3474085.3475692 http://dx.doi.org/10.1145/3474085.3475692
0
浏览量
25
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621