

浏览全部资源
扫码关注微信
1.中国矿业大学计算机科学与技术学院,江苏徐州 221116
2.矿山数字化教育部工程研究中心,江苏徐州 221116
Received:12 December 2023,
Revised:2024-02-26,
Published:25 April 2024
移动端阅览
刘兵, 李穗, 刘明明, 等. 基于全局与序列混合变分Transformer的多样化图像描述生成方法[J]. 电子学报, 2024, 52(04): 1305-1314.
LIU Bing, LI Sui, LIU Ming-ming, et al. Diverse Image Captioning Based on Hybrid Global and Sequential Variational Transformer[J]. Acta Electronica Sinica, 2024, 52(04): 1305-1314.
刘兵, 李穗, 刘明明, 等. 基于全局与序列混合变分Transformer的多样化图像描述生成方法[J]. 电子学报, 2024, 52(04): 1305-1314. DOI:10.12263/DZXB.20231155
LIU Bing, LI Sui, LIU Ming-ming, et al. Diverse Image Captioning Based on Hybrid Global and Sequential Variational Transformer[J]. Acta Electronica Sinica, 2024, 52(04): 1305-1314. DOI:10.12263/DZXB.20231155
多样化图像描述生成已成为图像描述领域研究热点.然而,现有方法忽视了全局和序列隐向量之间的依赖关系,严重限制了图像描述性能的提升.针对该问题,本文提出了基于混合变分Transformer的多样化图像描述生成框架.具体地,首先构建全局与序列混合条件变分自编码模型,解决全局与序列隐向量之间依赖关系表示的问题.其次,通过最大化条件似然推导混合模型的变分证据下界,解决多样化图像描述目标函数设计问题.最后,无缝融合Transformer和混合变分自编码模型,通过联合优化提升多样化图像描述的泛化性能.在MSCOCO数据集上实验结果表明,与当前最优基准方法相比,在随机生成20和100个描述语句时,多样性指标m-BLEU(mutual overlap-BiLingual Evaluation Understudy)分别提升了4.2%和4.7%,同时准确性指标CIDEr(Consensus-based Image Description Evaluation)分别提升了4.4%和15.2%.
Diverse image captioning has become a research hotspot in the field of image description. Existing methods generally ignore the dependency relationship between global and sequential latent vectors
which seriously limits the performance improvement. To address this problem
this paper proposes a hybrid variational Transformer based diverse image captioning framework. Firstly
we construct a hybrid conditional variational autoencoder to effectively model the dependency between global and sequential latent vectors. Secondly
the evidence lower bound is derived by maximizing the conditional likelihood of the hybrid autoencoder
which serves as the objective function for diverse image captioning. Finally
we seamlessly combine the Transformer model with the hybrid conditional variational autoencoder
which can be jointly optimized to improve the generalization performance of diverse image captioning. The experimental results on MSCOCO dataset show that compared with the state-of-the-art methods
when randomly generating 20 and 100 captions
the diversity metric m-BLEU (Mutual overlap Bilingual Evaluation Under study) has improved by 4.2% and 4.7%
respectively
while the accuracy metric CIDEr (Consensus based Image Description Evaluation) has improved by 4.4% and 15.2%
respectively.
STEFANINI M , CORNIA M , BARALDI L , et al . From show to tell: A survey on deep learning-based image captioning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 1 ): 539 - 559 .
ANDERSON P , HE X , BUEHLER C , et al . Bottom-up and top-down attention for image captioning and visual question answering [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2018 : 6077 - 6086 .
YANG X , ZHANG H W , CAI J F . Deconfounded image captioning: A causal retrospect [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 11 ): 12996 - 13010 .
石义乐 , 杨文忠 , 杜慧祥 , 等 . 基于深度学习的图像描述综述 [J ] . 电子学报 , 2021 , 49 ( 10 ): 2048 - 2060 .
SHI Y L , YANG W Z , DU H X , et al . Overview of image captions based on deep learning [J ] . Acta Electronica Sinica , 2021 , 49 ( 10 ): 2048 - 2060 . (in Chinese)
李志欣 , 魏海洋 , 黄飞成 , 等 . 结合视觉特征和场景语义的图像描述生成 [J ] . 计算机学报 , 2020 , 43 ( 9 ): 1624 - 1640 .
LI Z X , WEI H Y , HUANG F C , et al . Combine visual features and scene semantics for image captioning [J ] . Chinese Journal of Computers , 2020 , 43 ( 9 ): 1624 - 1640 . (in Chinese)
周东明 , 张灿龙 , 李志欣 , 等 . 基于多层级视觉融合的图像描述模型 [J ] . 电子学报 , 2021 , 49 ( 7 ): 1286 - 1290 .
ZHOU D M , ZHANG C L , LI Z X , et al . Image captioning model based on multi-level visual fusion [J ] . Acta Electronica Sinica , 2021 , 49 ( 7 ): 1286 - 1290 . (in Chinese)
刘茂福 , 施琦 , 聂礼强 . 基于视觉关联与上下文双注意力的图像描述生成方法 [J ] . 软件学报 , 2022 , 33 ( 9 ): 3210 - 3222 .
LIU M F , SHI Q , NIE L Q . Image captioning based on visual relevance and context dual attention [J ] . Journal of Software , 2022 , 33 ( 9 ): 3210 - 3222 . (in Chinese)
宋井宽 , 曾鹏鹏 , 顾嘉扬 , 等 . 基于视觉区域聚合与双向协作的端到端图像描述生成 [J ] . 软件学报 , 2023 , 34 ( 5 ): 2152 - 2169 .
SONG J K , ZENG P P , GU J Y , et al . End-to-end image captioning via visual region aggregation and dual-level collaboration [J ] . Journal of Software , 2023 , 34 ( 5 ): 2152 - 2169 . (in Chinese)
DAI B , FIDLER S , URTASUN R , et al . Towards diverse and natural image descriptions via a conditional GAN [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 2970 - 2979 .
SHETTY R , ROHRBACH M , HENDRICKS L A , et al . Speaking the same language: Matching machine to human captions by adversarial training [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 4135 - 4144 .
HUIJBEN I A M , KOOL W , PAULUS M B , et al . A review of the gumbel-max trick and its extensions for discrete stochasticity in machine learning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 2 ): 1353 - 1371 .
WANG L W , SCHWING A G , LAZEBNIK S . Diverse and accurate image description using a variational auto-encoder with an additive Gaussian encoding space [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 5758 - 5768 .
ANEJA J , AGRAWAL H , BATRA D , et al . Sequential latent spaces for modeling the intention during diverse image captioning [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 4261 - 4270 .
MAHAJAN S , ROTH S . Diverse image captioning with context-object split latent spaces [C ] // Proceedings of the 34th International Conference on Neural Information Processing Systems . Montreal : Curran Associates Inc. , 2020 : 3613 - 3624 .
WANG J , XU W , WANG Q , et al . On distinctive image captioning via comparing and reweighting [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 45 ( 2 ): 2088 - 2103 .
LIU Z , LIN Y T , CAO Y , et al . Swin Transformer: Hierarchical vision Transformer using shifted windows [C ] // 2021 IEEE International Conference on Computer Vision . Piscataway : IEEE , 2021 : 9992 - 10002 .
PAPINENI K , ROUKOS S , WARD T , et al . BLEU: A method for automatic evaluation of machine translation [C ] // Proceedings of the 40th Annual Meeting on Association for Computational Linguistics . New York : ACM , 2002 : 311 - 318 .
BANERJEE S , LAVIE A . METEOR: An automatic metric for MT evaluation with improved correlation with human judgments [C ] // Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization . Stroudsburg : Association for Computational Linguistics , 2005 : 65 - 72 .
LIN C Y . Rouge: A package for automatic evaluation of summaries [C ] // In Text summarization branches out: Proceedings of the ACL-04 workshop . Stroudsburg : Association for Computational Linguistics , 2004 : 74 - 81 .
VEDANTAM R , ZITNICK C L , PARIKH D . CIDEr: Consensus-based image description evaluation [C ] // 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2015 : 4566 - 4575 .
VIJAYAKUMAR A , COGSWELL M , SELVARAJU R , et al . Diverse beam search for improved description of complex scenes [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Melbourne : AAAI Press , 2018 : 7371 - 7379 .
DESHPANDE A , ANEJA J , WANG L W , et al . Fast, diverse and accurate image captioning guided by part-of-speech [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 10695 - 10704 .
XU J , LIU B , ZHOU Y , et al . Diverse image captioning via conditional variational autoencoder and dual contrastive learning [J ] . ACM Transactions on Multimedia Computing, Communications, and Applications , 2024 , 20 ( 1 ): 29 .
ZHENG Y , LI Y L , WANG S J . Divcon: Learning concept sequences for semantically diverse image captioning [C ] // ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2023 : 1 - 5 .
0
Views
50
下载量
4
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621