[1] Kiros R,Salakhutdinov R,Zemel R S.Unifying visual-semantic embeddings with multimodal neural language models[OL].http://arxiv.org/abs/1411.2539,2014-11-10. [2] Feng F,Wang X,Li R.Cross-modal retrieval with correspondence autoencoder[A].Proceedings of the 22nd ACM International Conference on Multimedia[C].New York,USA:ACM,2014.7-16. [3] Wang L,Li Y,Lazebnik S.Learning deep structure-preserving image-text embeddings[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2016.5005-5013. [4] Wang L,Li Y,Huang J,et al.Learning two-branch neural networks for image-text matching tasks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,41(2):394-407. [5] Vendrov I,Kiros R,Fidler S,et al.Order-embeddings of images and language[OL].http://arxiv.org/abs/1511.06361,2015-11-19. [6] Nam H,Ha J W,Kim J.Dual attention networks for multimodal reasoning and matching[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2017.299-307. [7] Faghri F,Fleet D J,Kiros J R,et al.VSE++:Improving visual-semantic embeddings with hard negatives[A].Proceedings of the 28th British Machine Vision Conference[C].Durham,UK:BMVA,2017.121-132. [8] Huang Y,Wu Q,Song C,et al.Learning semantic concepts and order for image and sentence matching[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2018.6163-6171. [9] Lee K H,Chen X,Hua G,et al.Stacked cross attention for image-text matching[A].Proceedings of the European Conference on Computer Vision[C].Cham,Swit:Spriger,2018.201-216. [10] He K,Zhang X,Ren S,et al.Deep residual learning for image recognition[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2016.770-778. [11] Zhang H,Goodfellow I,Metaxas D N,et al.Self-attention generative adversarial networks[A].Proceedings of International Conference on Machine Learning[C].New York,USA:ACM,2019.7354-7363. [12] Zhang X,Zhao J,LeCun Y.Character-level convolutional networks for text classification[A].Advances in Neural Information Processing Systems[C].Cambridge,UK:MIT Press,2015.649-657. [13] Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural computation,1997,9(8):1735-1780. [14] Ren S,He K,Girshick R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[A].Advances in Neural Information Processing Systems[C].Cambridge,UK:MIT Press,2015.91-99. [15] Anderson P,He X,Buehler C,et al.Bottom-up and top-down attention for image captioning and visual question answering[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2018.6077-6086. [16] Huang Z,Xu W,Yu K.Bidirectional LSTM-CRF models for sequence tagging[OL].http://arxiv.org/abs/1508.01991,2015-08-09. [17] Mnih V,Heess N,Graves A.Recurrent models of visual attention[A].Advances in Neural Information Processing Systems[C].Cambridge,UK:MIT Press,2014.2204-2212. [18] Hoffer E,Ailon N.Deep metric learning using triplet network[A].Proceedings of International Workshop on Similarity-Based Pattern Recognition[C].Cham,Swit:Springer,2015.84-92. [19] Karpathy A,Fei-Fei L.Deep visual-semantic alignments for generating image descriptions[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2015.3128-3137. [20] Yan F,Mikolajczyk K.Deep correlation for matching images and text[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2015.3441-3450. [21] Huang Y,Wang W,Wang L.Instance-aware image and sentence matching with selective multimodal LSTM[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2017.2310-2318. [22] Gu J,Cai J,Joty S,et al.Look,imagine and match:Improving textual-visual cross-modal retrieval with generative models[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2018.7181-7189. [23] Niu Z,Zhou M,Wang L,et al.Hierarchical multimodal LSTM for dense visual-semantic embedding[A].Proceedings of the IEEE International Conference on Computer Vision[C].Piscataway,USA:IEEE,2017.1881-1889. [24] Ma L,Lu Z,Shang L,et al.Multimodal convolutional neural networks for matching image and sentence[A].Proceedings of the IEEE International Conference on Computer Vision[C].Piscataway,USA:IEEE,2015.2623-2631. |