基于深度学习的图像描述综述

石义乐; 杨文忠; 杜慧祥; 王丽花; 王婷; 理珊珊

doi:10.12263/DZXB.20200669

您当前的位置：

首页 >

文章列表页 >

基于深度学习的图像描述综述

综述评论 | 更新时间：2025-12-08

- 基于深度学习的图像描述综述
- Overview of Image Captions Based on Deep Learning
- 电子学报 2021年49卷第10期页码：2048-2060
- 作者机构：
  
  1.新疆大学软件工程技术重点实验室，新疆乌鲁木齐 830000
  2.新疆大学信息科学与工程学院，新疆乌鲁木齐 830000
- 作者简介：
  
  [ "石义乐男，1994年生，河南洛阳人.现为新疆大学软件学院研究生.主要研究方向为图像理解. E-mail:2229842870@qq.com" ]
  [ "杨文忠(通信作者) 男，1971年生，新疆乌鲁木齐人. 2011年于武汉大学获得博士学位.现为新疆大学信息科学与工程学院研究生导师，副教授.主要研究方向为网络空间安全、机器学习和算法设计与分析. E-mail:ywz_xy@163.com" ]
- 基金信息：
  
  国家自然科学基金(U1603115);新疆维吾尔自治区自然科学基金(2017D01C042)
- DOI：10.12263/DZXB.20200669
  中图分类号： TP391.2;
- 收稿：2020-07-08，
  
  修回：2020-09-13，
  
  纸质出版：2021-10-25
- 稿件说明：
移动端阅览
石义乐,杨文忠,杜慧祥等.基于深度学习的图像描述综述[J].电子学报,2021,49(10):2048-2060.

SHI Yi-le,YANG Wen-zhong,DU Hui-xiang,et al.Overview of Image Captions Based on Deep Learning[J].ACTA ELECTRONICA SINICA,2021,49(10):2048-2060.
石义乐,杨文忠,杜慧祥等.基于深度学习的图像描述综述[J].电子学报,2021,49(10):2048-2060. DOI： 10.12263/DZXB.20200669.

SHI Yi-le,YANG Wen-zhong,DU Hui-xiang,et al.Overview of Image Captions Based on Deep Learning[J].ACTA ELECTRONICA SINICA,2021,49(10):2048-2060. DOI： 10.12263/DZXB.20200669.

摘要

图像描述旨在通过提取图像的特征输入到语言生成模型中最后输出图像对应的描述，来解决人工智能中自然语言处理与计算机视觉的交叉领域问题——智能图像理解.现对2015—2020年间图像描述方向有代表性的论文进行汇总与分析，以不同核心技术作为分类标准将图像描述大致划分为基于Encoder-Decoder框架的图像描述、基于注意力机制的图像描述、基于强化学习的图像描述、基于生成对抗网络的图像描述和基于新融合数据集的图像描述五大类.使用NIC、Hard-Attention和Neural Talk三个模型在真实数据集MS-COCO数据集上进行实验，并从BLEU1、BLEU2、BLEU3、BLEU4四处平均评分对比分析，展示三个模型效果.本文点明了未来图像描述的发展趋势，并指出了图像描述将要面临的挑战和可深入挖掘的研究方向.

Abstract

Image caption aims to extract the features of the image and input the description of the final output image into the language generation model

which solves the intersection of natural language processing and computer vision in artificial intelligence-image understanding. Summarize and analyze representative thesis of image description orientation from 2015 to 2020，different core technologies as classification criteria，it can be roughly divided into: image caption based on Encoder-Decoder framework

image caption based on attention mechanism

image caption based on reinforcement learning

image caption based on Generative Adversarial Networks

and based on new fusion data set these five categories. Use three models of NIC

Hard-Attention and Neural Talk to conduct experiments on the real data set MS-COCO data set

and compare the average scores of BLEU1

BLEU2

BLEU3

and BLEU4 to show the effects of the three models. This article points out the development trend of image caption in the future

and the challenges that image caption will face and the research directions that can be digged in.

关键词

Keywords

references

权宇 , 李志欣 , 张灿龙 , 等 . 融合深度扩张网络和轻量化网络的目标检测模型 [J]. 电子学报 , 2020 , 48 ( 2 ): 390 － 397 .

Quan Y , Li Z X , Zhang C L , et al . Fusing deep dilated convolutions network and light-weight network for object detection [J]. Acta Electronica Sinica , 2020 , 48 ( 2 ): 390 － 397 . (in Chinese)

刘颖 , 刘红燕 , 范九伦 , 等 . 基于深度学习的小目标检测研究与应用综述 [J]. 电子学报 , 2020 , 48 ( 3 ): 590 － 601 .

Liu Y , Liu H Y , Fan J L , et al . A survey of research and application of small object detection based on deep learning [J]. Acta Electronica Sinica , 2020 , 48 ( 3 ): 590 － 601 . (in Chinese)

杨KL . Image caption的发展历程和最新工作的简要综述 ( 2010-2018 ) [EB/OL]. https://blog.csdn.net/qq_41533506/article/details/84671195 https://blog.csdn.net/qq_41533506/article/details/84671195 , 2018-12-01 .

Vinyals O , Toshev A , Bengio S , et al . Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017 , 39 ( 4 ): 652 － 663 .

Tan X , Ren Y , He D , et al . Multilingual neural machine translation with knowledge distillation [EB/OL]. https://arxiv.org/abs/1902.10461v3 https://arxiv.org/abs/1902.10461v3 , 2019 .

亲历者 . show_and_tell代码实现及测试——批量训练 [EB/OL]. https://blog.csdn.net/m0_38073193/article/details/82502063 https://blog.csdn.net/m0_38073193/article/details/82502063 , 2018-09-07 .

Karpathy A , Li F F . Deep visual-semantic alignments for generating image descriptions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017 , 39 ( 4 ): 664 － 676 .

Simonyan K , Zisserman A . Very deep convolutional networks for large-scale image recognition [EB/OL]. https://arxiv.org/abs/1409.1556v4 https://arxiv.org/abs/1409.1556v4 , 2014 .

Fang H , Gupta S , Iandola F , et al . From captions to visual concepts and back [A]. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Boston, MA, USA : IEEE , 2015 . 1473 － 1482 .

Li N , Chen Z . Image cationing with visual-semantic LSTM [A]. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence [C]. Shanghai, China : IJCAI , 2018 . 793 － 799 .

Anderson P , He X D , Buehler C , et al . Bottom-up and top-down attention for image captioning and visual question answering [A]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition [C]. Salt Lake City, UT, USA : IEEE , 2018 . 6077 － 6086 .

Fei Z C . Better understanding hierarchical visual relationship for image caption [EB/OL]. https://www.researchgate.net/publication/337756448_Better_Understanding_Hierarchical_Visual_Relationship_for_Image_Caption https://www.researchgate.net/publication/337756448_Better_Understanding_Hierarchical_Visual_Relationship_for_Image_Caption , 2019 .

Lee K H , Palangi H , Chen X , et al . Learning visual relation priors for image-text matching and image captioning with neural scene graph generators [EB/OL]. https://arxiv.org/abs/1909.09953v1 https://arxiv.org/abs/1909.09953v1 , 2019 .

Yao T , Pan Y W , Li Y H , et al . Hierarchy parsing for image captioning [A]. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) [C]. Seoul, Korea (South) : IEEE , 2019 . 2621 － 2629 .

He S , Liao W T , Tavakoli H R , et al . Image captioning through image transformer [A]. Computer Vision－ACCV 2020 [M]. Cham, Germany : Springer International Publishing , 2021 . 153 － 169 .

张红斌 , 蒋子良 , 熊其鹏 , 等 . 基于改进的有效区域基因选择与跨模态语义挖掘的图像属性标注 [J]. 电子学报 , 2020 , 48 ( 4 ): 790 － 799 .

Zhang H B , Jiang Z L , Xiong Q P , et al . Image attribute annotation via a modified effective range based gene selection and cross-modal semantics mining [J]. Acta Electronica Sinica , 2020 , 48 ( 4 ): 790 － 799 . (in Chinese)

Chen F H , Ji R R , Sun X S , et al . GroupCap: group-based image captioning with structured relevance and diversity constraints [A]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition [C]. Salt Lake City, UT, USA : IEEE , 2018 . 1345 － 1353 .

Pasunuru R , Bansal M . Multi-task video captioning with video and entailment generation [A]. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) [C]. Stroudsburg, PA, USA : Association for Computational Linguistics , 2017 . 1273 － 1283 .

Zhou L W , Palangi H , Zhang L , et al . Unified vision-language pre-training for image captioning and VQA [J]. Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , 34 ( 7 ): 13041 － 13049 .

Wang Y F , Lin Z , Shen X H , et al . Skeleton key: Image captioning by skeleton-attribute decomposition [A]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Honolulu, HI, USA : IEEE , 2017 . 7378 － 7387 .

Lu J S , Yang J W , Batra D , et al . Neural baby talk [A]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition [C]. Salt Lake City, UT, USA : IEEE , 2018 . 7219 － 7228 .

Mathews A , Xie L X , He X M . SemStyle: learning to generate stylised image captions using unaligned text [A]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition [C]. Salt Lake City, UT, USA : IEEE , 2018 . 8591 － 8600 .

Wang L , Bai Z C , Zhang Y H , et al . Show, recall, and tell: Image captioning with recall mechanism [J]. Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , 34 ( 7 ): 12176 － 12183 .

Nguyen G , Jun T J , Tran T , et al . ContCap: A comprehensive framework for continual image captioning [EB/OL]. https://onikle.com/articles/14900 https://onikle.com/articles/14900 , 2019 .

Gu J X , Cai J F , Wang G , et al . Stack-captioning: Coarse-to-fine learning for image captioning [EB/OL]. https://arxiv.org/abs/1709.03376 https://arxiv.org/abs/1709.03376 , 2018 .

Fei Z C . Fast image caption generation with position alignment [EB/OL]. https://arxiv.org/abs/1912.06365 https://arxiv.org/abs/1912.06365 , 2019 .

Thapliyal A V , Soricut R . Cross-modal language generation using pivot stabilization for Web-scale language coverage [A]. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics [C]. Stroudsburg, PA, USA : Association for Computational Linguistics , 2020 . 160 － 170 .

Xia Q L , Huang H Y , Duan N , et al . XGPT: cross-modal generative pre-training for image captioning [EB/OL]. https://arxiv.org/abs/2003.01473 https://arxiv.org/abs/2003.01473 , 2020 .

Yao T , Pan Y , Li Y , et al . Incorporating copying mechanism in image captioning for learning novel objects [A]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [C]. New York, USA : IEEE , 2017 . 6580 － 6588 .

Aneja J , Deshpande A , Schwing A G . Convolutional image captioning [A]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition [C]. Salt Lake City, UT, USA , IEEE , 2018 : 5561 － 5570 .

Nikolaus M , Abdou M , Lamm M , et al . Compositional generalization in image captioning [A]. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) [C]. Stroudsburg, PA, USA : Association for Computational Linguistics , 2019 . 87 － 98 .

Lu D , Whitehead S , Huang L F , et al . Entity-aware image caption generation [A]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing [C]. Stroudsburg, PA, USA : Association for Computational Linguistics , 2018 . 4013 － 4023 .

Sammani F , Elsayed M . Look and modify: Modification networks for image captioning [EB/OL]. https://arxiv.org/abs/1909.03169v1 https://arxiv.org/abs/1909.03169v1 , 2019 .

Xu K , Ba J , Kiros R , et al . Show , attend and tell: Neural image caption generation with visual attention[EB/OL]. https://arxiv.org/abs/1502.03044 https://arxiv.org/abs/1502.03044 , 2015 .

Lu J S , Xiong C M , Parikh D , et al . Knowing when to look: Adaptive attention via a visual sentinel for image captioning [A]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Honolulu, HI, USA : IEEE , 2017 . 3242 － 3250 .

Huang L , Wang W M , Xia Y X , et al . Adaptively aligned image captioning via adaptive attention time [EB/OL]. https://arxiv.org/abs/1909.09060 https://arxiv.org/abs/1909.09060 , 2019 .

Goel A , Fernando B , Nguyen T S , et al . Learning to caption images with two-stream attention and sentence auto-encoder [EB/OL]. https://arxiv.org/abs/1911.10082 https://arxiv.org/abs/1911.10082 , 2019 .

Guo L T , Liu J , Zhu X X , et al . Normalized and geometry-aware self-attention network for image captioning [A]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Seattle, WA, USA : IEEE , 2020 . 10324 － 10333 .

Pan Y W , Yao T , Li Y H , et al . X-linear attention networks for image captioning [A]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Seattle, WA, USA : IEEE , 2020 . 10968 － 10977 .

Wu Q , Shen C H , Liu L Q , et al . What value do explicit high level concepts have in vision to language problems? [A]. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Las Vegas, NV, USA : IEEE , 2016 . 203 － 212 .

Sun J M , Lapuschkin S , Samek W , et al . Understanding image captioning models beyond visualizing attention [EB/OL]. https://arxiv.org/abs/2001.01037 https://arxiv.org/abs/2001.01037 , 2020 .

Liu F L , Ren X C , Liu Y X , et al . simNet: stepwise image-topic merging network for generating detailed and comprehensive image captions [A]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing [C]. Stroudsburg, PA, USA : Association for Computational Linguistics , 2018 . 137 － 149 .

Ranzato M , Chopra S , Auli M , et al . Sequence level training with recurrent neural networks [EB/OL]. https://arxiv.org/abs/1511.06732 https://arxiv.org/abs/1511.06732 , 2015 .

Liu S Q , Zhu Z H , Ye N , et al . Improved image captioning via policy gradient optimization of SPIDEr [A]. 2017 IEEE International Conference on Computer Vision (ICCV) [C]. Venice, Italy : IEEE , 2017 . 873 － 881 .

Ren Z , Wang X Y , Zhang N , et al . Deep reinforcement learning-based image captioning with embedding reward [A]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Honolulu, HI, USA : IEEE , 2017 . 1151 － 1159 .

Seo P H , Sharma P , Levinboim T , et al . Reinforcing an image caption generator using off-line human feedback [J]. Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , 34 ( 3 ): 2693 － 2700 .

Rennie S J , Marcheret E , Mroueh Y , et al . Self-critical sequence training for image captioning [A]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Honolulu, HI, USA : IEEE , 2017 . 1179 － 1195 .

Pasunuru R , Bansal M . Reinforced video captioning with entailment rewards [A]. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing [C]. Stroudsburg, PA, USA : Association for Computational Linguistics , 2017 . 979 － 985 .

Dai B , Fidler S , Urtasun R , et al . Towards diverse and natural image descriptions via a conditional GAN [A]. 2017 IEEE International Conference on Computer Vision (ICCV) [C]. Venice, Italy : IEEE , 2017 . 2989 － 2998 .

Shetty R , Rohrbach M , Hendricks L A , et al . Speaking the same language: Matching machine to human captions by adversarial training [A]. 2017 IEEE International Conference on Computer Vision (ICCV) [C]. Venice, Italy : IEEE , 2017 . 4155 － 4164 .

Liu B C , Song K P , Zhu Y Z , et al . TIME: text and image mutual-translation adversarial networks [EB/OL]. https://arxiv.org/abs/2005.13192 https://arxiv.org/abs/2005.13192 , 2020 .

Li N N , Chen Z Z . Learning compact reward for image captioning [EB/OL]. https://arxiv.org/abs/2003.10925 https://arxiv.org/abs/2003.10925 , 2020 .

Chen H G , Zhang H , Chen P Y , et al . Attacking visual language grounding with adversarial examples: A case study on neural image captioning [A]. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) [C]. Stroudsburg, PA, USA : Association for Computational Linguistics , 2018 . 2587 － 2597 .

Shekhar R , Pezzelle S , Klimovich Y , et al . FOIL it! Find One mismatch between Image and Language caption [A]. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) [C]. Stroudsburg, PA, USA : Association for Computational Linguistics , 2017 . 255 － 265 .

Dai B , Lin D H . Contrastive learning for image captioning [EB/OL]. https://arxiv.org/abs/1710.02534 https://arxiv.org/abs/1710.02534 , 2017 .

Feng Y , Ma L , Liu W , et al . Unsupervised image captioning [A]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Long Beach, CA, USA : IEEE , 2019 . 4120 － 4129 .

Bhargava S , Forsyth D . Exposing and correcting the gender bias in image captioning datasets and models [EB/OL]. https://arxiv.org/abs/1912.00578 https://arxiv.org/abs/1912.00578 , 2019 .

Shuster K , Humeau S , Hu H X , et al . Engaging image captioning via personality [A]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Long Beach, CA, USA : IEEE , 2019 . 12508 － 12518 .

Kim D J , Choi J , Oh T H , et al . Dense relational captioning: Triple-stream networks for relationship-based captioning [A]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Long Beach, CA, USA : IEEE , 2019 . 6264 － 6273 .

Biten A F , Gomez L , Rusiñol M , et al . Good news, everyone! context driven entity-aware captioning for news images [A]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Long Beach, CA, USA : IEEE , 2019 . 12458 － 12467 .

Guo L T , Liu J , Yao P , et al . MSCap: multi-style image captioning with unpaired stylized text [A]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [C]. Long Beach, CA, USA : IEEE , 2019 . 4199 － 4208 .

Sidorov O , Hu R H , Rohrbach M , et al . TextCaps: A dataset for image captioning with reading comprehension [A]. Computer Vision － ECCV 2020 [M]. Cham, Germany : Springer International Publishing , 2020. 742 － 758 .

Tran A , Mathews A , Xie L X . Transform and tell: Entity-aware news image captioning [EB/OL]. https://arxiv.org/abs/2004.08070 https://arxiv.org/abs/2004.08070 , 2020 .

Levinboim T , Thapliyal A V , Sharma P , et al . Quality estimation for image captions based on large-scale human evaluations [A]. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies [C]. Stroudsburg, PA, USA : Association for Computational Linguistics , 2021 . 3157 － 3166 .

Xie H Y , Sherborne T , Kuhnle A , et al . Going beneath the surface: Evaluating image captioning for grammaticality , truthfulness and diversity[EB/OL]. https://arxiv.org/abs/1912.08960 https://arxiv.org/abs/1912.08960 , 2019 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于多层级视觉融合的图像描述模型

基于邻域与超图协作的会话推荐

基于EIMYOLO的高分遥感图像目标检测

基于多重注意力和感知加权学习的单图像高动态范围重建

面向不同挑战及同异质信息分离的RGBT跟踪