融合两级相似度的跨媒体图像文本检索

doi:10.12263/DZXB.20191037

PDF(1984 KB)

电子学报 ›› 2021, Vol. 49 ›› Issue (2) : 268-274. DOI: 10.12263/DZXB.20191037

学术论文

融合两级相似度的跨媒体图像文本检索

李志欣¹, 凌锋¹, 张灿龙¹, 马慧芳²

作者信息 +

Cross-Media Image-Text Retrieval with Two Level Similarity

LI Zhi-xin¹, LING Feng¹, ZHANG Can-long¹, MA Hui-fang²

Author information +

文章历史 +

摘要

为了更好地揭示图像和文本之间潜在的语义关联，提出了一种融合两级相似度的跨媒体检索方法，构建两个子网分别处理全局特征和局部特征，以获取图像和文本之间更好的语义匹配.图像分为整幅图像和一些图像区域两种表示，文本也分为整个语句和一些单词两种表示.设计一个两级对齐方法分别匹配图像和文本的全局和局部表示，并融合两种相似度学习跨媒体的完整表示.在MSCOCO和Flickr30K数据集上的实验结果表明，本文方法能够使图像和文本的语义匹配更准确，优于许多当前先进的跨媒体检索方法.

Abstract

To better reveal the latent semantic correlation between image and text, this paper proposes a cross media retrieval method by fusing two level similarity, which constructs two subnets to deal with global features and local features respectively so as to obtain better semantic matching between image and text. The image representation is divided into whole image and some image regions, and the text representation is also divided into whole sentence and some words. A two level alignment method is designed to match the global and local representation of image and text, and the two similarities are fused to learn the complete cross-media representation. The experimental results on MSCOCO and Flickr30K datasets show that the proposed method can make the semantic matching of image and text more accurate, and is superior to many state-of-the-art cross-media retrieval methods.

导出引用

李志欣, 凌锋, 张灿龙, 马慧芳. 融合两级相似度的跨媒体图像文本检索[J]. 电子学报, 2021, 49(2): 268-274. https://doi.org/10.12263/DZXB.20191037

LI Zhi-xin, LING Feng, ZHANG Can-long, MA Hui-fang. Cross-Media Image-Text Retrieval with Two Level Similarity[J]. Acta Electronica Sinica, 2021, 49(2): 268-274. https://doi.org/10.12263/DZXB.20191037

中图分类号： TP391

参考文献

[1] Kiros R,Salakhutdinov R,Zemel R S.Unifying visual-semantic embeddings with multimodal neural language models[OL].http://arxiv.org/abs/1411.2539,2014-11-10.
[2] Feng F,Wang X,Li R.Cross-modal retrieval with correspondence autoencoder[A].Proceedings of the 22nd ACM International Conference on Multimedia[C].New York,USA:ACM,2014.7-16.
[3] Wang L,Li Y,Lazebnik S.Learning deep structure-preserving image-text embeddings[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2016.5005-5013.
[4] Wang L,Li Y,Huang J,et al.Learning two-branch neural networks for image-text matching tasks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,41(2):394-407.
[5] Vendrov I,Kiros R,Fidler S,et al.Order-embeddings of images and language[OL].http://arxiv.org/abs/1511.06361,2015-11-19.
[6] Nam H,Ha J W,Kim J.Dual attention networks for multimodal reasoning and matching[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2017.299-307.
[7] Faghri F,Fleet D J,Kiros J R,et al.VSE++:Improving visual-semantic embeddings with hard negatives[A].Proceedings of the 28th British Machine Vision Conference[C].Durham,UK:BMVA,2017.121-132.
[8] Huang Y,Wu Q,Song C,et al.Learning semantic concepts and order for image and sentence matching[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2018.6163-6171.
[9] Lee K H,Chen X,Hua G,et al.Stacked cross attention for image-text matching[A].Proceedings of the European Conference on Computer Vision[C].Cham,Swit:Spriger,2018.201-216.
[10] He K,Zhang X,Ren S,et al.Deep residual learning for image recognition[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2016.770-778.
[11] Zhang H,Goodfellow I,Metaxas D N,et al.Self-attention generative adversarial networks[A].Proceedings of International Conference on Machine Learning[C].New York,USA:ACM,2019.7354-7363.
[12] Zhang X,Zhao J,LeCun Y.Character-level convolutional networks for text classification[A].Advances in Neural Information Processing Systems[C].Cambridge,UK:MIT Press,2015.649-657.
[13] Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural computation,1997,9(8):1735-1780.
[14] Ren S,He K,Girshick R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[A].Advances in Neural Information Processing Systems[C].Cambridge,UK:MIT Press,2015.91-99.
[15] Anderson P,He X,Buehler C,et al.Bottom-up and top-down attention for image captioning and visual question answering[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2018.6077-6086.
[16] Huang Z,Xu W,Yu K.Bidirectional LSTM-CRF models for sequence tagging[OL].http://arxiv.org/abs/1508.01991,2015-08-09.
[17] Mnih V,Heess N,Graves A.Recurrent models of visual attention[A].Advances in Neural Information Processing Systems[C].Cambridge,UK:MIT Press,2014.2204-2212.
[18] Hoffer E,Ailon N.Deep metric learning using triplet network[A].Proceedings of International Workshop on Similarity-Based Pattern Recognition[C].Cham,Swit:Springer,2015.84-92.
[19] Karpathy A,Fei-Fei L.Deep visual-semantic alignments for generating image descriptions[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2015.3128-3137.
[20] Yan F,Mikolajczyk K.Deep correlation for matching images and text[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2015.3441-3450.
[21] Huang Y,Wang W,Wang L.Instance-aware image and sentence matching with selective multimodal LSTM[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2017.2310-2318.
[22] Gu J,Cai J,Joty S,et al.Look,imagine and match:Improving textual-visual cross-modal retrieval with generative models[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2018.7181-7189.
[23] Niu Z,Zhou M,Wang L,et al.Hierarchical multimodal LSTM for dense visual-semantic embedding[A].Proceedings of the IEEE International Conference on Computer Vision[C].Piscataway,USA:IEEE,2017.1881-1889.
[24] Ma L,Lu Z,Shang L,et al.Multimodal convolutional neural networks for matching image and sentence[A].Proceedings of the IEEE International Conference on Computer Vision[C].Piscataway,USA:IEEE,2015.2623-2631.