电子学报 ›› 2022, Vol. 50 ›› Issue (5): 1123-1130.DOI: 10.12263/DZXB.20210696

• 学术论文 • 上一篇    下一篇

基于双注意模型的图像描述生成方法研究

卓亚琦1, 魏家辉2, 李志欣2   

  1. 1.桂林理工大学理学院, 广西 桂林 541004
    2.广西师范大学广西多源信息挖掘与安全重点实验室, 广西 桂林 541004
  • 收稿日期:2021-05-31 修回日期:2021-09-22 出版日期:2022-05-25 发布日期:2022-06-18
  • 通讯作者: 李志欣
  • 作者简介:卓亚琦 女,1976年6月出生,陕西咸阳人.桂林理工大学理学院讲师.研究方向为图像理解与机器学习.E-mail: zhuoyaqi@126.com
    魏家辉 男,1993年11月出生,山东东营人.广西师范大学计算机科学与工程学院博士研究生.研究方向为图像理解与机器学习.E-mail: weijh@stu.gxnu.edu.cn
    李志欣 男,1971年10月出生,广西桂林人.现为广西师范大学计算机科学与工程学院教授、博士生导师.研究领域为图像理解、机器学习与跨媒体计算.E-mail: lizx@gxnu.edu.cn
  • 基金资助:
    国家自然科学基金(61966004);广西自然科学基金(2019GXNSFDA245018);广西研究生教育创新计划(XYCBZ2021002)

Research on Image Captioning Based on Double Attention Model

ZHUO Ya-qi1, WEI Jia-hui2, LI Zhi-xin2   

  1. 1.College of Science,Guilin University of Technology,Guilin,Guangxi 541004,China
    2.Guangxi Key Lab of Multi-source Information Mining and Security,Guangxi Normal University,Guilin,Guangxi 541004,China
  • Received:2021-05-31 Revised:2021-09-22 Online:2022-05-25 Published:2022-06-18
  • Contact: LI Zhi-xin

摘要:

现有图像描述生成方法的注意模型通常采用单词级注意,从图像中提取局部特征作为生成当前单词的视觉信息输入,缺乏准确的图像全局信息指导.针对这个问题,提出基于语句级注意的图像描述生成方法,通过自注意机制从图像中提取语句级的注意信息,来表示生成语句所需的图像全局信息.在此基础上,结合语句级注意和单词级注意进一步提出了双注意模型,以此来生成更准确的图像描述.通过在模型的中间阶段实施监督和优化,以解决信息间的干扰问题.此外,将强化学习应用于两阶段的训练来优化模型的评估度量.通过在MSCOCO和Flickr30K两个基准数据集上的实验评估,结果表明本文提出的方法能够生成更加准确和丰富的描述语句,并且在各项评价指标上优于现有的多种基于注意机制的方法.

关键词: 图像描述生成, 编码器-解码器架构, 单词级注意, 语句级注意, 双注意模型, 强化学习

Abstract:

The attention model of existing image captioning approaches usually adopt word-level attention, which extracts local features from images. The local features are used as the visual information input to generate the current word, lacking accurate image global information guidance. To solve this problem, this paper proposed image captioning approach based on sentence-level attention. The approach employs the self-attention mechanism to extract the sentence-level attention information from the image, which is used to represent the global image information needed to generate sentences. On this basis, we further proposes a double attention model which combines sentence-level attention with word-level attention to generate more accurate description. We implement supervision and optimization in the intermediate stage of the model to solve the problem of information interference. In addition, reinforcement learning is applied in two-stage training to optimize the evaluation metric of the model. Finally, we evaluated our approach on two baseline datasets, i.e. MSCOCO and Flickr30K. Experimental results show that the proposed approach can generate more accurate and richer captions. Hence it outperforms many state-of-the-art image captioning approaches based on attention mechanism in various evaluation metrics.

Key words: image captioning, encoder-decoder architecture, word-level attention, sentence-level attention, double attention model, reinforcement learning

中图分类号: