基于多层级视觉融合的图像描述模型

周东明; 张灿龙; 李志欣; 王智文

doi:10.12263/DZXB.20191296

您当前的位置：

首页 >

文章列表页 >

基于多层级视觉融合的图像描述模型

学术论文 | 更新时间：2025-12-08

- 基于多层级视觉融合的图像描述模型
- Image Captioning Model Based on Multi‑Level Visual Fusion
- 电子学报 2021年49卷第7期页码：1286-1290
- 作者机构：
  
  1.广西师范大学广西多源信息挖掘与安全重点实验室，广西桂林 541004
  2.广西科技大学计算机科学与通信工程学院，广西柳州 545006
- 作者简介：
  
  [ "周东明　男，1995年5月出生于河南省信阳市，现为广西师范大学硕士研究生，主要研究方向为机器学习与图像处理．E‑mail:dmzhou1995@163.com" ]
  [ "张灿龙（通讯作者）　男，1975年出生于湖南省娄底市.广西师范大学教授，博士生导师．毕业于上海交通大学，获控制理论与控制工程专业博士学位．主要从事计算机视觉与机器学习．E‑mail:zcltyp@163.com" ]
- 基金信息：
  
  国家自然科学基金(61866004;61663004;61966004;61962007;61751213);广西自然科学基金(2018GXNSFDA281009;2017GXNSFAA198365;2019GXNSFDA245018;2018GXNSFDA29400);广西“八桂学者”创新研究团队;广西多源信息挖掘与安全重点实验室基金(20‑A‑03‑01);广西研究生教育创新计划(XYCSZ2020071)
- DOI：10.12263/DZXB.20191296
  中图分类号： TP181
- 收稿：2019-11-21，
  
  修回：2021-02-04，
  
  纸质出版：2021-07-25
- 稿件说明：
移动端阅览
周东明,张灿龙,李志欣等.基于多层级视觉融合的图像描述模型[J].电子学报,2021,49(07):1286-1290.

ZHOU Dong-ming,ZHANG Can-long,LI Zhi-xin,et al.Image Captioning Model Based on Multi‑Level Visual Fusion[J].ACTA ELECTRONICA SINICA,2021,49(07):1286-1290.
周东明,张灿龙,李志欣等.基于多层级视觉融合的图像描述模型[J].电子学报,2021,49(07):1286-1290. DOI： 10.12263/DZXB.20191296.

ZHOU Dong-ming,ZHANG Can-long,LI Zhi-xin,et al.Image Captioning Model Based on Multi‑Level Visual Fusion[J].ACTA ELECTRONICA SINICA,2021,49(07):1286-1290. DOI： 10.12263/DZXB.20191296.

摘要

传统方法在视觉策略网络中只关注实体，不能够推理出实体和属性之间的联系，在语言策略网络存在暴露偏差和误差累计问题.为此，提出了一个基于强化学习的多层级视觉融合网络模型.在视觉策略网络中通过多层级神经网络模块将视觉特征转化为视觉知识的特征集.融合网络生成使描述语句更加流畅的虚词，用于视觉策略网络和语言策略网络的互动.在语言策略网络中使用基于强化学习的自批评策略梯度算法对视觉融合网络实现端到端的优化.实验结果表明，该模型在MS‑COCO数据集取得不错效果，将Karpathy分割测试中的CIDEr值从120.1提高到124.3.

Abstract

Traditional methods only focus on entities in the visual strategy network and cannot deduce the relationship between entities and attributes. There are problems of exposure bias and error accumulation in the language strategy network. Therefore

this paper proposes a multi‑level visual fusion network model based on reinforcement learning. In the visual strategy network

multi‑level sub‑neural network module is used to transform visual features into feature sets of visual knowledge. The fusion network generates the function words which make the description sentences more fluent and can be used for the interaction between the visual strategy network and the language strategy network. The gradient algorithm of self‑criticism strategy based on reinforcement learning is used to optimize the visual fusion network end‑to‑end. The experimental results show that the model can get good results in MS‑COCO data set and improve the CIDEr value of Karpathy segmentation test from 120.1 to 124.3.

关键词

Keywords

references

Chen S ， Jin Q ， Wang P . Say as you wish ： fine‑grained control of image caption generation with abstract scene graphs ［A］. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ［C］. USA ： IEEE ， 2020 . 9962 - 9971 .

Shi J ， Zhang H ， Li J . Explainable and explicit visual reasoning over scene graphs ［A］. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ［C］. California ： IEEE ， 2019 . 8376 - 8384 .

张志昌，曾扬扬，庞雅丽 . 融合语义角色和自注意力机制的中文文本蕴含识别［J］. 电子学报， 2020 ， 48 （ 11 ）： 2162 - 2169 .

ZHANG Zhi‑chang ， ZENG Yang‑yang ， PANG Ya‑li . A Chinese textual entailment recognition method incorporating semantic role and self‑attention ［J］. Acta Electronica Sinica ， 2020 ， 48 （ 11 ）： 2162 - 2169 . （in Chinese）

Rennie S J ， Marcheret E ， Mroueh Y . Self‑critical sequence training for image captioning ［A］. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ［C］. Hawaii ： IEEE ， 2017 . 7008 - 7024 .

Lu J ， Yang J ， Batra D . Neural baby talk ［A］. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ［C］. Salt Lake City ： IEEE ， 2018 . 7219 - 7228 .

Anderson P ， He X ， Buehler C . Bottom‑up and top‑down attention for image captioning and visual question answering ［A］. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ［C］. Salt Lake City ： IEEE ， 2018 . 6077 - 6086 .

汤鹏杰，王瀚漓，许恺晟 . LSTM 逐层多目标优化及多层概率融合的图像描述［J］. 自动化学报， 2018 ， 44 （ 7 ）： 1237 - 1249 .

TANG Peng‑jie ， WANG Han‑li ， XU Kai‑sheng . Multi‑ objective layer‑wise optimization and multi‑level probability fusion for image description generation using lstm ［J］. Acta Automatica Sinica ， 2018 ， 44 （ 7 ）： 1237 - 1249 . （in Chinese）

Deshpande A ， Aneja J ， Wang L. Fast ， diverse and accurate image captioning guided by part‑of‑speech ［A］. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ［C］. California ： IEEE ， 2019 . 10695 - 10704 .

Yang X ， Tang K ， Zhang H . Auto‑encoding scene graphs for image captioning ［A］. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ［C］. California ： IEEE ， 2019 . 10685 - 10694 .

Chen L ， Zhang H ， Xiao J . SCA‑CNN： spatial and channel‑ wise attention in convolutional networks for image captioning ［A］. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ［C］. Hawaii ： IEEE ， 2017 . 5659 - 5667 .

Feng Y ， Ma L ， Liu W . Unsupervised image captioning ［A］. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ［C］. California ： IEEE ， 2019 . 4125 - 4134 .

Jiang W ， Ma L ， Jiang Y . Recurrent fusion network for image captioning ［A］. Proceedings of the European Conference on Computer Vision ［C］. Germany ： Springer ， 2018 . 499 - 515 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于深度学习的图像描述综述

基于邻域与超图协作的会话推荐

基于EIMYOLO的高分遥感图像目标检测

基于敏感组件函数调用图的安卓重打包恶意软件检测方法

基于多重注意力和感知加权学习的单图像高动态范围重建