CLGLF: Confidence Learning Guides Label Fusion for Multimodal Named Entity Recognition Method

WANG Hai-rong; WANG Tong; XU Xi; JING Bo-xiang; CHEN Fang-ping

doi:10.12263/DZXB.20231160

您当前的位置：

首页 >

文章列表页 >

CLGLF: Confidence Learning Guides Label Fusion for Multimodal Named Entity Recognition Method

PAPERS | 更新时间：2025-12-24

- CLGLF: Confidence Learning Guides Label Fusion for Multimodal Named Entity Recognition Method
- ACTA ELECTRONICA SINICA Vol. 52, Issue 7, Pages: 2429-2437(2024)
- 作者机构：
  
  1.北方民族大学计算机科学与工程学院,宁夏银川 750021
  2.北方民族大学图像图形智能处理国家民委重点实验室,宁夏银川 750021
- 作者简介：
- 基金信息：
  
  Natural Science Foundation of Ningxia Province(2023AAC03316);Graduate Innovation Program of North Minzu University(YCX23159)
- DOI：10.12263/DZXB.20231160
  CLC： TP391;
- Received：15 December 2023，
  
  Revised：2024-04-01，
  
  Published：25 July 2024
- 稿件说明：
移动端阅览
王海荣, 王彤, 徐玺, 等. CLGLF:置信学习引导标签融合的多模态命名实体识别方法[J]. 电子学报, 2024, 52(07): 2429-2437.

WANG Hai-rong, WANG Tong, XU Xi, et al. CLGLF: Confidence Learning Guides Label Fusion for Multimodal Named Entity Recognition Method[J]. Acta Electronica Sinica, 2024, 52(07): 2429-2437.
王海荣, 王彤, 徐玺, 等. CLGLF:置信学习引导标签融合的多模态命名实体识别方法[J]. 电子学报, 2024, 52(07): 2429-2437. DOI：10.12263/DZXB.20231160

WANG Hai-rong, WANG Tong, XU Xi, et al. CLGLF: Confidence Learning Guides Label Fusion for Multimodal Named Entity Recognition Method[J]. Acta Electronica Sinica, 2024, 52(07): 2429-2437. DOI：10.12263/DZXB.20231160

摘要

为解决多模态命名实体识别中存在的视觉语义理解和多模态语义的偏差问题，本文提出了置信学习引导标签融合的多模态命名实体识别方法.该方法调用BLIP-2预训练模型生成图像描述，将其与输入的文本拼接，进行图文联合编码实现多模态特征融合，对多模态表征和文本表征解码后得到候选标签和文本标签；在采用KL散度损失函数对齐两组标签的基础上，计算置信分数用来评估多模态表征质量，设置置信阈值辅助筛选出有偏差的候选标签，并使用相应位置的文本标签替换有偏差的候选标签，实现标签的融合，最终完成多模态命名实体识别.为了验证本文方法，在Twitter-2015和Twitter-2017多模态数据集上进行实验，并将实验结果与MSB、UMT等7种主流方法进行对比，实验结果证明了本文方法的有效性.

Abstract

To solve the visual semantic understanding bias and multimodal semantic bias in multimodal named entity recognition

the confidence learning guides label fusion (CLGLF) method for multimodal named entity recognition is proposed. This method invokes the BLIP-2 pre-trained model to generate image captions

concatenates them with the input texts

and performs joint coding to achieve multimodal feature fusion. The candidate labels and text labels are obtained after decoding the multimodal representations and text representations. Based on using the KL divergence loss function to align the two groups of labels

the confidence score is calculated to evaluate the quality of the multimodal representation

and a confidence threshold is set to help screen out the biased candidate labels

the text labels in the corresponding positions are used to replace the biased candidate labels

to achieve the label fusion

and finally complete the multimodal named entity recognition. In order to verify the proposed method

experiments are carried out on the Twitter-2015 and Twitter-2017 multimodal datasets

and the experimental results are compared with 7 mainstream methods

such as MSB and UMT. The experimental results show the effectiveness of the CLGLF.

关键词

Keywords

references

张聿远 , 闫文君 , 张立民 . 基于多模态特征融合网络的空时分组码识别算法 [J ] . 电子学报 , 2023 , 51 ( 2 ): 489 - 498 .

ZHANG Y Y , YAN W J , ZHANG L M . Space-time block code recognition algorithm based on multi-modality features fusion network [J ] . Acta Electronica Sinica , 2023 , 51 ( 2 ): 489 - 498 . (in Chinese)

MOON S , NEVES L , CARVALHO V . Multimodal named entity recognition for short social media posts [C ] // Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Stroudsburg : Association for Computational Linguistics , 2018 : 852 - 860 .

LU D , NEVES L , CARVALHO V , et al . Visual attention model for name tagging in multimodal social media [C ] // Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : Association for Computational Linguistics , 2018 : 1990 - 1999 .

ASGARI-CHENAGHLU M , FEIZI-DERAKHSHI M R , FARZINVASH L , et al . CWI: A multimodal deep learning approach for named entity recognition from social media using character, word and image features [J ] . Neural Computing and Applications , 2022 , 34 ( 3 ): 1905 - 1922 .

ZHANG Q , FU J L , LIU X Y , et al . Adaptive co-attention network for named entity recognition in tweets [C ] // Proceedings of the Thirty-Second AAAI Conferenceon Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence . New Orleans : AAAI Press , 2018 : 5674 - 5681 .

WANG X W , YE J B , LI Z X , et al . CAT-MNER: Multimodal named entity recognition with knowledge-refined cross-modal attention [C ] // 2022 IEEE International Conference on Multimedia and Expo (ICME) . Piscataway : IEEE , 2022 : 1 - 6 .

李晓腾 , 张盼盼 , 勾智楠 , 等 . 基于多任务学习的多模态命名实体识别方法 [J ] . 计算机工程 , 2023 , 49 ( 4 ): 114 - 119 .

LI X T , ZHANG P P , GOU Z N , et al . Multi-modal named entity recognition method based on multi-task learning [J ] . Computer Engineering , 2023 , 49 ( 4 ): 114 - 119 . (in Chinese)

SUN L , WANG J Q , ZHANG K , et al . RpBERT: A text-image relation propagation-based BERT model for multimodal NER [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 15 ): 13860 - 13868 .

ZHENG C M , WU Z W , WANG T , et al . Object-aware multimodal named entity recognition in social media posts with adversarial learning [J ] . IEEE Transactions on Multimedia , 2021 , 23 : 2520 - 2532 .

ZHANG D , WEI S Z , LI S S , et al . Multi-modal graph fusion for named entity recognition with targeted visual guidance [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 16 ): 14347 - 14355 .

WU Z W , ZHENG C M , CAI Y , et al . Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts [C ] // Proceedings of the 28th ACM International Conference on Multimedia . New York : ACM , 2020 : 1038 - 1046 .

钟维幸 , 王海荣 , 王栋 , 等 . 多模态语义协同交互的图文联合命名实体识别方法 [J ] . 广西科学 , 2022 , 29 ( 4 ): 681 - 690 .

ZHONG W X , WANG H R , WANG D , et al . Image-text joint named entity recognition method based on multi-modal semantic interaction [J ] . Guangxi Sciences , 2022 , 29 ( 4 ): 681 - 690 . (in Chinese)

WANG X Y , GUI M , JIANG Y , et al . ITA: Image-text alignments for multi-modal named entity recognition [C ] // Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Stroudsburg : Association for Computational Linguistics , 2022 : 3176 - 3189 .

CHEN X , ZHANG N Y , LI L , et al . Good visual guidance make a better extractor: Hierarchical visual prefix for multimodal entity and relation extraction [C ] // Findings of the Association for Computational Linguistics: NAACL 2022 . Stroudsburg : Association for Computational Linguistics , 2022 : 1607 - 1618 .

XU B , HUANG S Z , SHA C F , et al . MAF: A general matching and alignment framework for multimodal named entity recognition [C ] // Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining . New York : ACM , 2022 : 1215 - 1223 .

ZHANG Z X , MAI W X , XIONG H L , et al . A token-wise graph-based framework for multimodal named entity recognition [C ] // IEEE International Conference on Multimedia and Expo (ICME) . Piscataway : IEEE , 2023 : 2153 - 2158 .

MAI W X , ZHANG Z X , LI K T , et al . Dynamic graph construction framework for multimodal named entity recognition in social media [J ] . IEEE Transactions on Computational Social Systems , 2024 , 11 ( 2 ): 2513 - 2522 .

ZHAO F , LI C , WU Z , et al . Learning from different text-image pairs: A relation-enhanced graph convolutional network for multimodal NER [C ] // Proceedings of the 30th ACM International Conference on Multimedia . New York : ACM , 2022 : 3983 - 3992 .

JIA M , SHEN L , SHEN X , et al . MNER-QG: An end-to-end MRC framework for multimodal named entity recognition with query grounding [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2023 , 37 ( 7 ): 8032 - 8040 .

ZHANG X , YUAN J L , LI L , et al . Reducing the bias of visual objects in multimodal named entity recognition [C ] // Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining . New York : ACM , 2023 : 958 - 966 .

YU J F , JIANG J , YANG L , et al . Improving multimodal named entity recognition via entity span detection with unified multimodal transformer [C ] // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : Association for Computational Linguistics , 2020 : 3342 - 3352 .

LIU L P , WANG M L , ZHANG M Z , et al . UAMNer: Uncertainty-aware multimodal named entity recognition in social media posts [J ] . Applied Intelligence , 2022 , 52 ( 4 ): 4109 - 4125 .

WANG J , YANG Y , LIU K Y , et al . M3S: Scene graph driven multi-granularity multi-task learning for multi-modal NER [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2022 , 31 : 111 - 120 .

LI J N , LI D X , SAVARESE S , et al . BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models [C ] // Proceedings of the 40th International Conference on Machine Learning (ICML’23 ), Honolulu : JMLR.org , 2023 : 19730 - 19742 .

PAN Y H . On visual understanding [J ] . Frontiers of Information Technology & Electronic Engineering , 2022 , 23 ( 9 ): 1287 - 1289 .

ZHANG S S , ROLLER S , GOYAL N , et al . OPT: Open pre-trained transformer language models [EB/OL ] . ( 2022-06-21 )[ 2023-12-11 ] . http://arxiv.org/abs/2205.01068 http://arxiv.org/abs/2205.01068 .

NORTHCUTT C , JIANG L , CHUANG I . Confident learning: Estimating uncertainty in dataset labels [J ] . Journal of Artificial Intelligence Research , 2021 , 70 : 1373 - 1411 .

CHENG J , LONG K F , ZHANG S , et al . Text-image scene graph fusion for multi-modal named entity recognition [J ] . IEEE Transactions on Artificial Intelligence , 2023 , PP( 99 ): 1 - 12 .

VINYALS O , TOSHEV A , BENGIO S , et al . Show and tell: A neural image caption generator [C ] // 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2015 : 3156 - 3164 .

HE K M , GKIOXARI G , DOLLÁR P , et al . Mask R-CNN‍ [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 2980 - 2988 .

杜晋华 , 尹浩 , 冯嵩 . 中文电子病历命名实体识别的研究与进展 [J ] . 电子学报 , 2022 , 50 ( 12 ): 3030 - 3053 .

DU J H , YIN H , FENG S . Research and development of named entity recognition in Chinese electronic medical record [J ] . Acta Electronica Sinica , 2022 , 50 ( 12 ): 3030 - 3053 . (in Chinese)

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

JURIS: Judical Understanding-Enhanced Reasoning via Instruction-Tuned Strategies for Named Entity Recognition

Joint Extraction of Entities and Relations Based on Deep Learning: A Survey

Automated Entity Relation Tuple Extraction Using Web Mining

Related Author

ZHANG Zhen

LIU Qiong-lin

HU Ying

RUAN Ri-qing

PENG Han

XIN Yong-hui

REN Le

LIU Yang

Related Institution

School of AI and Advanced Computing, Hunan University of Technology and Business

School of Computer Science, Hunan University of Technology and Business

Xiangjiang Laboratory

School of Intelligent Robotics, Hunan University of Technology and Business

Institute of Intelligent Information Processing， Beijing Information Science and Technology University

⁰