基于双注意模型的图像描述生成方法研究

卓亚琦; 魏家辉; 李志欣

doi:10.12263/DZXB.20210696

您当前的位置：

首页 >

文章列表页 >

基于双注意模型的图像描述生成方法研究

学术论文 | 更新时间：2025-12-08

- 基于双注意模型的图像描述生成方法研究
- Research on Image Captioning Based on Double Attention Model
- 电子学报 2022年50卷第5期页码：1123-1130
- 作者机构：
  
  1.桂林理工大学理学院, 广西桂林 541004
  2.广西师范大学广西多源信息挖掘与安全重点实验室, 广西桂林 541004
- 作者简介：
  
  [ "卓亚琦　女,1976年6月出生,陕西咸阳人.桂林理工大学理学院讲师.研究方向为图像理解与机器学习.E-mail: zhuoyaqi@126.com" ]
  [ "魏家辉　男,1993年11月出生,山东东营人.广西师范大学计算机科学与工程学院博士研究生.研究方向为图像理解与机器学习.E-mail: weijh@stu.gxnu.edu.cn" ]
  [ "李志欣男,1971年10月出生,广西桂林人.现为广西师范大学计算机科学与工程学院教授、博士生导师.研究领域为图像理解、机器学习与跨媒体计算.E-mail: lizx@gxnu.edu.cn" ]
- 基金信息：
  
  国家自然科学基金(61966004;61866004);广西自然科学基金(2019GXNSFDA245018);广西研究生教育创新计划(XYCBZ2021002)
- DOI：10.12263/DZXB.20210696
  中图分类号： TP391
- 收稿：2021-05-31，
  
  修回：2021-09-22，
  
  纸质出版：2022-05-25
- 稿件说明：
移动端阅览
卓亚琦,魏家辉,李志欣.基于双注意模型的图像描述生成方法研究[J].电子学报,2022,50(05):1123-1130.

ZHUO Ya-qi,WEI Jia-hui,LI Zhi-xin.Research on Image Captioning Based on Double Attention Model[J].ACTA ELECTRONICA SINICA,2022,50(05):1123-1130.
卓亚琦,魏家辉,李志欣.基于双注意模型的图像描述生成方法研究[J].电子学报,2022,50(05):1123-1130. DOI： 10.12263/DZXB.20210696.

ZHUO Ya-qi,WEI Jia-hui,LI Zhi-xin.Research on Image Captioning Based on Double Attention Model[J].ACTA ELECTRONICA SINICA,2022,50(05):1123-1130. DOI： 10.12263/DZXB.20210696.

摘要

现有图像描述生成方法的注意模型通常采用单词级注意，从图像中提取局部特征作为生成当前单词的视觉信息输入，缺乏准确的图像全局信息指导.针对这个问题，提出基于语句级注意的图像描述生成方法，通过自注意机制从图像中提取语句级的注意信息，来表示生成语句所需的图像全局信息.在此基础上，结合语句级注意和单词级注意进一步提出了双注意模型，以此来生成更准确的图像描述.通过在模型的中间阶段实施监督和优化，以解决信息间的干扰问题.此外，将强化学习应用于两阶段的训练来优化模型的评估度量.通过在MSCOCO和Flickr30K两个基准数据集上的实验评估，结果表明本文提出的方法能够生成更加准确和丰富的描述语句，并且在各项评价指标上优于现有的多种基于注意机制的方法.

Abstract

The attention model of existing image captioning approaches usually adopt word-level attention

which extracts local features from images. The local features are used as the visual information input to generate the current word

lacking accurate image global information guidance. To solve this problem

this paper proposed image captioning approach based on sentence-level attention. The approach employs the self-attention mechanism to extract the sentence-level attention information from the image

which is used to represent the global image information needed to generate sentences. On this basis

we further proposes a double attention model which combines sentence-level attention with word-level attention to generate more accurate description. We implement supervision and optimization in the intermediate stage of the model to solve the problem of information interference. In addition

reinforcement learning is applied in two-stage training to optimize the evaluation metric of the model. Finally

we evaluated our approach on two baseline datasets

i.e. MSCOCO and Flickr30K. Experimental results show that the proposed approach can generate more accurate and richer captions. Hence it outperforms many state-of-the-art image captioning approaches based on attention mechanism in various evaluation metrics.

关键词

Keywords

references

李志欣 , 魏海洋 , 张灿龙 , 等 . 图像描述生成研究进展 [J]. 计算机研究与发展 , 2021 , 58 ( 9 ): 1951 ‑ 1974 .

LI Zhi-xin , WEI Hai-yang , ZHANG Can-long , et al . Research progress on image captioning [J]. Journal of Computer Research and Development , 2021 , 58 ( 9 ): 1951 ‑ 1974 . (in Chinese)

DAI J , LI Y , HE K , et al . R-FCN: Object detection via region-based fully convolutional networks [C]// Advances in Neural Information Processing Systems . Cambridge, USA : MIT Press , 2016 : 379 ‑ 387 .

LI Zhi-xin , LIN Lan , ZHANG Can-long , et al . A semi-supervised learning approach based on adaptive weighted fusion for automatic image annotation [J]. ACM Transactions on Multimedia Computing, Communications, and Applications , 2021 , 17 ( 1 ): article37 .

VINYALS O , TOSHEV A , BENGIO S , et al . Show and tell: A neural image caption generator [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Los Alamitos, USA : IEEE Computer Society , 2015 : 3156 ‑ 3164 .

KARPATHY A , LI F F . Deep visual-semantic alignments for generating image descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Los Alamitos, USA : IEEE Computer Society , 2015 : 3128 ‑ 3137 .

MAO J , XU W , YANG Y , et al . Deep captioning with multimodal recurrent neural networks(m-RNN) [EB/OL]. [ 2021-09-22 ]. https://arxiv.org/abs/1412.6632 https://arxiv.org/abs/1412.6632 .

XU K , BA J , KIROS R , et al . Show, attend and tell: Neural image caption generation with visual attention [C]// Proceedings of International Conference on Machine Learning . Cambridge, USA : MIT Press , 2015 : 2048 ‑ 2057 .

CHEN L , ZHANG H , XIAO J , et al . SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Los Alamitos, USA : IEEE Computer Society , 2017 : 6298 ‑ 6306 .

LU J , XIONG C , PARIKH D , et al . Knowing when to look: Adaptive attention via a visual sentinel for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Los Alamitos, USA : IEEE Computer Society , 2017 : 3242 ‑ 3250 .

YOU Q , JIN H , WANG Z , et al . Image captioning with semantic attention [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Los Alamitos, USA : IEEE Computer Society , 2016 : 4651 ‑ 4659 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C]// Advances in Neural Information Processing Systems . Cambridge, USA : MIT Press , 2017 : 5998 ‑ 6008 .

YU A W , DOHAN D , LUONG M T , et al . QANet: Combining local convolution with global self-attention for reading comprehension [EB/OL]. [ 2021-09-22 ]. https://arxiv.org/abs/1804.09541 https://arxiv.org/abs/1804.09541 .

RANZATO M A , CHOPRA S , AULI M , et al . Sequence level training with recurrent neural networks [EB/OL]. [ 2021-09-22 ]. https://arxiv.org/abs/1511.06732 https://arxiv.org/abs/1511.06732 .

RENNIE S J , MARCHERET E , MROUEH Y , et al . Self-critical sequence training for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Los Alamitos, USA : IEEE Computer Society , 2017 : 1179 ‑ 1195 .

PAPINENI K , ROUKOS S , WARD T , et al . BLEU: a method for automatic evaluation of machine translation [C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics . Stroudsburg, USA : ACL , 2002 : 311 ‑ 318 .

BANERJEE S , LAVIE A . METEOR: An automatic metric for MT evaluation with improved correlation with human judgments [C]// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization . Stroudsburg, USA : ACL , 2005 : 65 ‑ 72 .

LIN C Y . ROUGE: A package for automatic evaluation of summaries [C]// Proceedings of the ACL Workshop on Text Summarization Branches Out . Stroudsburg, USA : ACL , 2004 : 74 ‑ 81 .

VEDANTAM R , ZITNICK C L , PARIKH D . CIDEr: Consensus-based image description evaluation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Los Alamitos, USA : IEEE Computer Society , 2015 : 4566 ‑ 4575 .

WILLIAMS R J . Simple statistical gradient-following algorithms for connectionist reinforcement learning [J]. Machine learning , 1992 , 8 ( 3-4 ): 229 ‑ 256 .

HE K , ZHANG X , REN S , et al . Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Los Alamitos, USA : IEEE Computer Society , 2016 : 770 ‑ 778 .

DIEDERIK K , JIMMY B . ADAM: A method for stochastic optimization [EB/OL]. [ 2021-09-22 ]. https://arxiv.org/abs/1412.6980 https://arxiv.org/abs/1412.6980 .

JIA X , GAVVES E , FERNANDO B , et al . Guiding the long-short term memory model for image caption generation [C]// Proceedings of the IEEE International Conference on Computer Vision . Piscataway, USA : IEEE , 2015 : 2407 ‑ 2415 .

WANG C , YANG H , BARTZ C , et al . Image captioning with deep bidirectional LSTMs [C]// Proceedings of the 24th ACM International Conference on Multimedia . New York, USA : ACM , 2016 : 988 ‑ 997 .

FU K , JIN J , CUI R , et al . Aligning where to see and what to tell: Image caption with region-based attention and scene-specific contexts [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017 , 39 ( 12 ): 2321 ‑ 2334 .

CHEN X , MA L , JIANG W , et al . Regularizing RNNs for caption generation by reconstructing the past with the present [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Los Alamitos, USA : IEEE Computer Society , 2018 : 7995 ‑ 8003 .

LIU S , ZHU Z , YE N , et al . Improved image captioning via policy gradient optimization of SPIDEr [C]// Proceedings of the IEEE International Conference on Computer Vision . Piscataway, USA : IEEE , 2017 : 873 ‑ 881 .

ANDERSON P , HE X , BUEHLER C , et al . Bottom-up and top-down attention for image captioning and visual question answering [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Los Alamitos, USA : IEEE Computer Society , 2018 : 6077 ‑ 6086 .

HUANG Fei-cheng , LI Zhi-xin , WEI Hai-yang , et al . Boost image captioning with knowledge reasoning [J]. Machine Learning , 2020 , 109 ( 12 ): 2313 ‑ 2332 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于图组合优化的高效社区搜索

知识数据协同的多对手智能空中博弈策略设计

基于强化学习的免调参即插即用单光子图像重建方法

基于强化学习的离散事件系统最优定向监控