ZHOU Dong-ming,ZHANG Can-long,LI Zhi-xin,et al.Image Captioning Model Based on Multi‑Level Visual Fusion[J].ACTA ELECTRONICA SINICA,2021,49(07):1286-1290.
Traditional methods only focus on entities in the visual strategy network and cannot deduce the relationship between entities and attributes. There are problems of exposure bias and error accumulation in the language strategy network. Therefore
this paper proposes a multi‑level visual fusion network model based on reinforcement learning. In the visual strategy network
multi‑level sub‑neural network module is used to transform visual features into feature sets of visual knowledge. The fusion network generates the function words which make the description sentences more fluent and can be used for the interaction between the visual strategy network and the language strategy network. The gradient algorithm of self‑criticism strategy based on reinforcement learning is used to optimize the visual fusion network end‑to‑end. The experimental results show that the model can get good results in MS‑COCO data set and improve the CIDEr value of Karpathy segmentation test from 120.1 to 124.3.
关键词
Keywords
references
Chen S , Jin Q , Wang P . Say as you wish : fine‑grained control of image caption generation with abstract scene graphs [A]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [C]. USA : IEEE , 2020 . 9962 - 9971 .
Shi J , Zhang H , Li J . Explainable and explicit visual reasoning over scene graphs [A]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [C]. California : IEEE , 2019 . 8376 - 8384 .
ZHANG Zhi‑chang , ZENG Yang‑yang , PANG Ya‑li . A Chinese textual entailment recognition method incorporating semantic role and self‑attention [J]. Acta Electronica Sinica , 2020 , 48 ( 11 ): 2162 - 2169 . (in Chinese)
Rennie S J , Marcheret E , Mroueh Y . Self‑critical sequence training for image captioning [A]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [C]. Hawaii : IEEE , 2017 . 7008 - 7024 .
Lu J , Yang J , Batra D . Neural baby talk [A]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [C]. Salt Lake City : IEEE , 2018 . 7219 - 7228 .
Anderson P , He X , Buehler C . Bottom‑up and top‑down attention for image captioning and visual question answering [A]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [C]. Salt Lake City : IEEE , 2018 . 6077 - 6086 .
TANG Peng‑jie , WANG Han‑li , XU Kai‑sheng . Multi‑ objective layer‑wise optimization and multi‑level probability fusion for image description generation using lstm [J]. Acta Automatica Sinica , 2018 , 44 ( 7 ): 1237 - 1249 . (in Chinese)
Deshpande A , Aneja J , Wang L. Fast , diverse and accurate image captioning guided by part‑of‑speech [A]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [C]. California : IEEE , 2019 . 10695 - 10704 .
Yang X , Tang K , Zhang H . Auto‑encoding scene graphs for image captioning [A]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [C]. California : IEEE , 2019 . 10685 - 10694 .
Chen L , Zhang H , Xiao J . SCA‑CNN: spatial and channel‑ wise attention in convolutional networks for image captioning [A]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [C]. Hawaii : IEEE , 2017 . 5659 - 5667 .
Feng Y , Ma L , Liu W . Unsupervised image captioning [A]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [C]. California : IEEE , 2019 . 4125 - 4134 .
Jiang W , Ma L , Jiang Y . Recurrent fusion network for image captioning [A]. Proceedings of the European Conference on Computer Vision [C]. Germany : Springer , 2018 . 499 - 515 .