Diverse Image Captioning via Conditional Variational Inference and Introspective Adversarial Learning

LIU Bing; LI Sui; LIU Ming-ming; LIU Hao

doi:10.12263/DZXB.20231156

您当前的位置：

首页 >

文章列表页 >

Diverse Image Captioning via Conditional Variational Inference and Introspective Adversarial Learning

PAPERS | 更新时间：2025-12-24

- Diverse Image Captioning via Conditional Variational Inference and Introspective Adversarial Learning
- ACTA ELECTRONICA SINICA Vol. 52, Issue 7, Pages: 2219-2227(2024)
- 作者机构：
  
  1.中国矿业大学计算机科学与技术学院，江苏徐州 221116
  2.矿山数字化教育部工程研究中心，江苏徐州 221116
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62276266;61801198)
- DOI：10.12263/DZXB.20231156
  CLC： TP391
- Received：12 December 2023，
  
  Revised：2024-05-19，
  
  Published：25 July 2024
- 稿件说明：
移动端阅览
刘兵, 李穗, 刘明明, 等. 基于条件变分推断与内省对抗学习的多样化图像描述生成[J]. 电子学报, 2024, 52(07): 2219-2227.

LIU Bing, LI Sui, LIU Ming-ming, et al. Diverse Image Captioning via Conditional Variational Inference and Introspective Adversarial Learning[J]. Acta Electronica Sinica, 2024, 52(07): 2219-2227.
刘兵, 李穗, 刘明明, 等. 基于条件变分推断与内省对抗学习的多样化图像描述生成[J]. 电子学报, 2024, 52(07): 2219-2227. DOI：10.12263/DZXB.20231156

LIU Bing, LI Sui, LIU Ming-ming, et al. Diverse Image Captioning via Conditional Variational Inference and Introspective Adversarial Learning[J]. Acta Electronica Sinica, 2024, 52(07): 2219-2227. DOI：10.12263/DZXB.20231156

摘要

现有多样化图像描述生成方法受到隐空间表示能力和评价指标制约，很难同时兼顾描述生成的多样性和准确性.为此，本文提出了一种新的多样化图像描述生成模型，该模型由一个条件变分推断编码器和一个生成器组成.编码器利用全局注意力学习每个单词的隐向量空间，以提升模型对描述多样化的建模能力.生成器根据给定图像和序列隐向量生成多样化的描述语句.同时，引入内省对抗学习的思想，条件变分推断编码器同时作为鉴别器来区分真实描述和生成的描述，赋予模型自我评价生成的描述语句的能力，克服预定义评价指标的局限性.在MSCOCO数据集上的实验表明，与传统方法相比，在随机生成100个描述语句时，多样性指标mBLEU（mutual overlap-BiLingual Evaluation Understudy）提升了1.9%，同时准确性指标CIDEr（Consensus-based Image Description Evaluation）显著提升了7.5%.与典型多模态大模型相比，所提出方法在较小参数量的条件下更适用于生成多样化的陈述性描述语句.

Abstract

Limited by the latent space modeling ability and pre-defined diversity metrics

most diverse image captioning models fail to achieve a balance between diversity and accuracy. To this end

we propose a novel diverse image captioning framework

which consists of a transformer based variational inference encoder and a generator. Specifically

the variational inference network aims to learn a latent space for each word to enhance the ability of caption diversity modeling

while the generator network produces diverse captions conditioned on each image and a sequence of latent variables. To overcome the limitation of pre-defined metrics

we introduce introspective adversarial learning into the proposed model

where the variational inference network also serves as a discriminator to distinguish between the ground truth captions and those produced by the generator without extra discriminators. The proposed method is endowed the ability to self-evaluate the quality of generated captions. The experimental results on dataset MSCOCO show that compared with the conventional methods

the proposed method with 100 samples improves the mBLEU (mutual overlap-BiLingual Evaluation Understudy) scores by 1.9% and the CIDEr (Consensus-based Image Description Evaluation) scores by 7.5%

respectively. Compared with typical multimodal large models

the proposed method is more suitable for generating diverse declarative descriptive captions with smaller parameters.

关键词

Keywords

references

STEFANINI M , CORNIA M , BARALDI L , et al . From show to tell: A survey on deep learning-based image captioning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 1 ): 539 - 559 .

刘浩阳 , 林耀进 , 刘景华 , 等 . 由粗到细的分层特征选择‍ [J ] . 电子学报 , 2022 , 50 ( 11 ): 2778 - 2789 .

LIU H Y , LIN Y J , LIU J H , et al . Hierarchical feature selection from coarse to fine [J ] . Acta Electronica Sinica , 2022 , 50 ( 11 ): 2778 - 2789 . (in Chinese)

YANG X , ZHANG H W , CAI J F . Deconfounded image captioning: A causal retrospect [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 11 ): 12996 - 13010 .

卓亚琦 , 魏家辉 , 李志欣 . 基于双注意模型的图像描述生成方法研究 [J ] . 电子学报 , 2022 , 50 ( 5 ): 1123 - 1130 .

ZHUO Y Q , WEI J H , LI Z X . Research on image captioning based on double attention model [J ] . Acta Electronica Sinica , 2022 , 50 ( 5 ): 1123 - 1130 . (in Chinese)

李志欣 , 魏海洋 , 黄飞成 , 等 . 结合视觉特征和场景语义的图像描述生成 [J ] . 计算机学报 , 2020 , 43 ( 9 ): 1624 - 1640 .

LI Z X , WEI H Y , HUANG F C , et al . Combine visual features and scene semantics for image captioning [J ] . Chinese Journal of Computers , 2020 , 43 ( 9 ): 1624 - 1640 . (in Chinese)

魏博文 , 全红艳 . 基于语义与形态特征融合的语义分割网络 [J ] . 电子学报 , 2022 , 50 ( 11 ): 2688 - 2697 .

WEI B W , QUAN H Y . Semantic segmentation network based on semantic and morphological feature fusion [J ] . Acta Electronica Sinica , 2022 , 50 ( 11 ): 2688 - 2697 . (in Chinese)

石义乐 , 杨文忠 , 杜慧祥 , 等 . 基于深度学习的图像描述综述 [J ] . 电子学报 , 2021 , 49 ( 10 ): 2048 - 2060 .

SHI Y L , YANG W Z , DU H X , et al . Overview of image captions based on deep learning [J ] . Acta Electronica Sinica , 2021 , 49 ( 10 ): 2048 - 2060 . (in Chinese)

宋井宽 , 曾鹏鹏 , 顾嘉扬 , 等 . 基于视觉区域聚合与双向协作的端到端图像描述生成 [J ] . 软件学报 , 2023 , 34 ( 5 ): 2152 - 2169 .

SONG J K , ZENG P P , GU J Y , et al . End-to-end image captioning via visual region aggregation and dual-level collaboration [J ] . Journal of Software , 2023 , 34 ( 5 ): 2152 - 2169 . (in Chinese)

DAI B , FIDLER S , URTASUN R , et al . Towards diverse and natural image descriptions via a conditional GAN [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 2970 - 2979 .

WANG L W , SCHWING A G , LAZEBNIK S . Diverse and accurate image description using a variational auto-encoder with an additive Gaussian encoding space [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 5758 - 5768 .

ANEJA J , AGRAWAL H , BATRA D , et al . Sequential latent spaces for modeling the intention during diverse image captioning [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 4261 - 4270 .

MAHAJAN S , ROTH S . Diverse image captioning with context-object split latent spaces [C ] // Proceedings of the 34th International Conference on Neural Information Processing Systems . New York : ACM , 2020 : 3613 - 3624 .

WANG Q , WAN J , CHAN A B . On diversity in image captioning: Metrics and methods [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 44 ( 2 ): 1035 - 1049 .

XU J , LIU B , ZHOU Y , et al . Diverse image captioning via conditional variational autoencoder and dual contrastive learning [J ] . ACM Transactions on Multimedia Computing, Communications, and Applications , 2024 , 20 ( 1 ): 29 .

LI J N , LI D X , SAVARESE S , et al . Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models [C ] // Proceedings of the 40th International Conference on Machine Learning . New York : ML Research Press , 2023 : 19730 - 19742 .

LIU H T , LI C Y , WU Q Y , et al . Visual instruction tuning [C ] // Proceedings of the 37st International Conference on Neural Information Processing Systems . New York : ACM , 2023 : 4566 - 4575 .

CHEN L , LI J , DONG X , et al . Sharegpt 4 v: Improving large multi-modal models with better captions[EB/OL ] . ( 2023-11-28 )[ 2024-05-21 ] . https://arxiv.org/abs/2311.12793 https://arxiv.org/abs/2311.12793 .

HUANG H B , LI Z H , HE R , et al . IntroVAE: Introspective variational autoencoders for photographic image synthesis [C ] // Proceedings of the 32nd International Conference on Neural Information Processing Systems . New York : ACM , 2018 : 52 - 63 .

REN S Q , HE K M , GIRSHICK R , et al . Faster R-CNN: Towards real-time object detection with region proposal networks [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017 , 39 ( 6 ): 1137 - 1149 .

VEDANTAM R , ZITNICK C L , PARIKH D . CIDEr: Consensus-based image description evaluation [C ] // 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2015 : 4566 - 4575 .

PAPINENI K , ROUKOS S , WARD T , et al . BLEU: A method for automatic evaluation of machine translation‍ [C ] // Proceedings of the 40th Annual Meeting on Association for Computational Linguistics . New York : ACM , 2002 : 311 - 318 .

WANG J N , XU W J , WANG Q Z , et al . On distinctive image captioning via comparing and reweighting [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 2 ): 2088 - 2103 .

VIJAYAKUMAR A , COGSWELL M , SELVARAJU R , et al . Diverse beam search for improved description of complex scenes [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Menlo Park : AAAI Press , 2018 : 7371 - 7379 .

DESHPANDE A , ANEJA J , WANG L W , et al . Fast, diverse and accurate image captioning guided by part-of-speech [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 10695 - 10704 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Diverse Image Captioning Based on Hybrid Global and Sequential Variational Transformer

Cross-Modal SAR Target Detection via Progressive Knowledge Transfer

Adversarial Learning and Enhanced Optimization Based Restoration Method for VC-Generated Speeches

Related Author

LIU Hao

LI Sui

LIU Bing

DONG Gang-gang

ZHAO Guo-wei

JIANG Jia-qing

LIAN Chen-si

WANG Nian-song

Related Institution

Ministry of Education Engineering Research Center of Mine Digitization

National Key Laboratory of Radar Signal Processing, Xidian University

Joint Laboratory of Intelligent Prevention and Recognition of Audio and Video

Intelligent Interconnected Systems Laboratory of Anhui Province (Hefei University of Technology)

Department of Physical Evidence Identification, Anhui Public Security Department

⁰