Diverse Image Captioning Based on Hybrid Global and Sequential Variational Transformer

LIU Bing; LI Sui; LIU Ming-ming; LIU Hao

doi:10.12263/DZXB.20231155

您当前的位置：

首页 >

文章列表页 >

Diverse Image Captioning Based on Hybrid Global and Sequential Variational Transformer

PAPERS | 更新时间：2025-12-08

- Diverse Image Captioning Based on Hybrid Global and Sequential Variational Transformer
- ACTA ELECTRONICA SINICA Vol. 52, Issue 4, Pages: 1305-1314(2024)
- 作者机构：
  
  1.中国矿业大学计算机科学与技术学院，江苏徐州 221116
  2.矿山数字化教育部工程研究中心，江苏徐州 221116
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62276266;61801198)
- DOI：10.12263/DZXB.20231155
  CLC： TP391
- Received：12 December 2023，
  
  Revised：2024-02-26，
  
  Published：25 April 2024
- 稿件说明：
移动端阅览
刘兵, 李穗, 刘明明, 等. 基于全局与序列混合变分Transformer的多样化图像描述生成方法[J]. 电子学报, 2024, 52(04): 1305-1314.

LIU Bing, LI Sui, LIU Ming-ming, et al. Diverse Image Captioning Based on Hybrid Global and Sequential Variational Transformer[J]. Acta Electronica Sinica, 2024, 52(04): 1305-1314.
刘兵, 李穗, 刘明明, 等. 基于全局与序列混合变分Transformer的多样化图像描述生成方法[J]. 电子学报, 2024, 52(04): 1305-1314. DOI：10.12263/DZXB.20231155

LIU Bing, LI Sui, LIU Ming-ming, et al. Diverse Image Captioning Based on Hybrid Global and Sequential Variational Transformer[J]. Acta Electronica Sinica, 2024, 52(04): 1305-1314. DOI：10.12263/DZXB.20231155

摘要

多样化图像描述生成已成为图像描述领域研究热点.然而，现有方法忽视了全局和序列隐向量之间的依赖关系，严重限制了图像描述性能的提升.针对该问题，本文提出了基于混合变分Transformer的多样化图像描述生成框架.具体地，首先构建全局与序列混合条件变分自编码模型，解决全局与序列隐向量之间依赖关系表示的问题.其次，通过最大化条件似然推导混合模型的变分证据下界，解决多样化图像描述目标函数设计问题.最后，无缝融合Transformer和混合变分自编码模型，通过联合优化提升多样化图像描述的泛化性能.在MSCOCO数据集上实验结果表明，与当前最优基准方法相比，在随机生成20和100个描述语句时，多样性指标m-BLEU（mutual overlap-BiLingual Evaluation Understudy）分别提升了4.2%和4.7%，同时准确性指标CIDEr（Consensus-based Image Description Evaluation）分别提升了4.4%和15.2%.

Abstract

Diverse image captioning has become a research hotspot in the field of image description. Existing methods generally ignore the dependency relationship between global and sequential latent vectors

which seriously limits the performance improvement. To address this problem

this paper proposes a hybrid variational Transformer based diverse image captioning framework. Firstly

we construct a hybrid conditional variational autoencoder to effectively model the dependency between global and sequential latent vectors. Secondly

the evidence lower bound is derived by maximizing the conditional likelihood of the hybrid autoencoder

which serves as the objective function for diverse image captioning. Finally

we seamlessly combine the Transformer model with the hybrid conditional variational autoencoder

which can be jointly optimized to improve the generalization performance of diverse image captioning. The experimental results on MSCOCO dataset show that compared with the state-of-the-art methods

when randomly generating 20 and 100 captions

the diversity metric m-BLEU (Mutual overlap Bilingual Evaluation Under study) has improved by 4.2% and 4.7%

respectively

while the accuracy metric CIDEr (Consensus based Image Description Evaluation) has improved by 4.4% and 15.2%

respectively.

关键词

Keywords

references

STEFANINI M , CORNIA M , BARALDI L , et al . From show to tell: A survey on deep learning-based image captioning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 1 ): 539 - 559 .

ANDERSON P , HE X , BUEHLER C , et al . Bottom-up and top-down attention for image captioning and visual question answering [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2018 : 6077 - 6086 .

YANG X , ZHANG H W , CAI J F . Deconfounded image captioning: A causal retrospect [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 11 ): 12996 - 13010 .

石义乐 , 杨文忠 , 杜慧祥 , 等 . 基于深度学习的图像描述综述 [J ] . 电子学报 , 2021 , 49 ( 10 ): 2048 - 2060 .

SHI Y L , YANG W Z , DU H X , et al . Overview of image captions based on deep learning [J ] . Acta Electronica Sinica , 2021 , 49 ( 10 ): 2048 - 2060 . (in Chinese)

李志欣 , 魏海洋 , 黄飞成 , 等 . 结合视觉特征和场景语义的图像描述生成 [J ] . 计算机学报 , 2020 , 43 ( 9 ): 1624 - 1640 .

LI Z X , WEI H Y , HUANG F C , et al . Combine visual features and scene semantics for image captioning [J ] . Chinese Journal of Computers , 2020 , 43 ( 9 ): 1624 - 1640 . (in Chinese)

周东明 , 张灿龙 , 李志欣 , 等 . 基于多层级视觉融合的图像描述模型 [J ] . 电子学报 , 2021 , 49 ( 7 ): 1286 - 1290 .

ZHOU D M , ZHANG C L , LI Z X , et al . Image captioning model based on multi-level visual fusion [J ] . Acta Electronica Sinica , 2021 , 49 ( 7 ): 1286 - 1290 . (in Chinese)

刘茂福 , 施琦 , 聂礼强 . 基于视觉关联与上下文双注意力的图像描述生成方法 [J ] . 软件学报 , 2022 , 33 ( 9 ): 3210 - 3222 .

LIU M F , SHI Q , NIE L Q . Image captioning based on visual relevance and context dual attention [J ] . Journal of Software , 2022 , 33 ( 9 ): 3210 - 3222 . (in Chinese)

宋井宽 , 曾鹏鹏 , 顾嘉扬 , 等 . 基于视觉区域聚合与双向协作的端到端图像描述生成 [J ] . 软件学报 , 2023 , 34 ( 5 ): 2152 - 2169 .

SONG J K , ZENG P P , GU J Y , et al . End-to-end image captioning via visual region aggregation and dual-level collaboration [J ] . Journal of Software , 2023 , 34 ( 5 ): 2152 - 2169 . (in Chinese)

DAI B , FIDLER S , URTASUN R , et al . Towards diverse and natural image descriptions via a conditional GAN [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 2970 - 2979 .

SHETTY R , ROHRBACH M , HENDRICKS L A , et al . Speaking the same language: Matching machine to human captions by adversarial training [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 4135 - 4144 .

HUIJBEN I A M , KOOL W , PAULUS M B , et al . A review of the gumbel-max trick and its extensions for discrete stochasticity in machine learning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 2 ): 1353 - 1371 .

WANG L W , SCHWING A G , LAZEBNIK S . Diverse and accurate image description using a variational auto-encoder with an additive Gaussian encoding space [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 5758 - 5768 .

ANEJA J , AGRAWAL H , BATRA D , et al . Sequential latent spaces for modeling the intention during diverse image captioning [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 4261 - 4270 .

MAHAJAN S , ROTH S . Diverse image captioning with context-object split latent spaces [C ] // Proceedings of the 34th International Conference on Neural Information Processing Systems . Montreal : Curran Associates Inc. , 2020 : 3613 - 3624 .

WANG J , XU W , WANG Q , et al . On distinctive image captioning via comparing and reweighting [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 45 ( 2 ): 2088 - 2103 .

LIU Z , LIN Y T , CAO Y , et al . Swin Transformer: Hierarchical vision Transformer using shifted windows [C ] // 2021 IEEE International Conference on Computer Vision . Piscataway : IEEE , 2021 : 9992 - 10002 .

PAPINENI K , ROUKOS S , WARD T , et al . BLEU: A method for automatic evaluation of machine translation [C ] // Proceedings of the 40th Annual Meeting on Association for Computational Linguistics . New York : ACM , 2002 : 311 - 318 .

BANERJEE S , LAVIE A . METEOR: An automatic metric for MT evaluation with improved correlation with human judgments [C ] // Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization . Stroudsburg : Association for Computational Linguistics , 2005 : 65 - 72 .

LIN C Y . Rouge: A package for automatic evaluation of summaries [C ] // In Text summarization branches out: Proceedings of the ACL-04 workshop . Stroudsburg : Association for Computational Linguistics , 2004 : 74 - 81 .

VEDANTAM R , ZITNICK C L , PARIKH D . CIDEr: Consensus-based image description evaluation [C ] // 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2015 : 4566 - 4575 .

VIJAYAKUMAR A , COGSWELL M , SELVARAJU R , et al . Diverse beam search for improved description of complex scenes [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Melbourne : AAAI Press , 2018 : 7371 - 7379 .

DESHPANDE A , ANEJA J , WANG L W , et al . Fast, diverse and accurate image captioning guided by part-of-speech [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 10695 - 10704 .

XU J , LIU B , ZHOU Y , et al . Diverse image captioning via conditional variational autoencoder and dual contrastive learning [J ] . ACM Transactions on Multimedia Computing, Communications, and Applications , 2024 , 20 ( 1 ): 29 .

ZHENG Y , LI Y L , WANG S J . Divcon: Learning concept sequences for semantically diverse image captioning [C ] // ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2023 : 1 - 5 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Diverse Image Captioning via Conditional Variational Inference and Introspective Adversarial Learning

Image Captioning Model Based on Multi‑Level Visual Fusion

Object Size Constancy Computation Based on Visual Psychology

Research on Radio Signal Modulation Recognition Based on Hyperbolic State Space Model

A Priority-Weighted and Packet Arrival Time Based Scheduling Algorithm for MP-QUIC

Related Author

LIU Bing

LI Sui

LIU Hao

ZHOU Dong-ming

WANG Zhi-wen

LI Zhi-xin

ZHANG Can-long

XU De

Related Institution

Mine Digitization Engineering Research Center of the Ministry of Education

Guangxi Key Laboratory of Multi‑source Information Mining and Security， Guangxi Normal University

School of Computer Science and Communication Engineering， Guangxi University of Science and Technology

Institute of Computer Science,Beijing Jiaotong University

Dongying Vocational College

⁰