中图分类号:
TP391
{{custom_clc.code}}
({{custom_clc.text}})
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Kiros R,Salakhutdinov R,Zemel R S.Unifying visual-semantic embeddings with multimodal neural language models[OL].http://arxiv.org/abs/1411.2539,2014-11-10.
[2] Feng F,Wang X,Li R.Cross-modal retrieval with correspondence autoencoder[A].Proceedings of the 22nd ACM International Conference on Multimedia[C].New York,USA:ACM,2014.7-16.
[3] Wang L,Li Y,Lazebnik S.Learning deep structure-preserving image-text embeddings[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2016.5005-5013.
[4] Wang L,Li Y,Huang J,et al.Learning two-branch neural networks for image-text matching tasks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,41(2):394-407.
[5] Vendrov I,Kiros R,Fidler S,et al.Order-embeddings of images and language[OL].http://arxiv.org/abs/1511.06361,2015-11-19.
[6] Nam H,Ha J W,Kim J.Dual attention networks for multimodal reasoning and matching[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2017.299-307.
[7] Faghri F,Fleet D J,Kiros J R,et al.VSE++:Improving visual-semantic embeddings with hard negatives[A].Proceedings of the 28th British Machine Vision Conference[C].Durham,UK:BMVA,2017.121-132.
[8] Huang Y,Wu Q,Song C,et al.Learning semantic concepts and order for image and sentence matching[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2018.6163-6171.
[9] Lee K H,Chen X,Hua G,et al.Stacked cross attention for image-text matching[A].Proceedings of the European Conference on Computer Vision[C].Cham,Swit:Spriger,2018.201-216.
[10] He K,Zhang X,Ren S,et al.Deep residual learning for image recognition[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2016.770-778.
[11] Zhang H,Goodfellow I,Metaxas D N,et al.Self-attention generative adversarial networks[A].Proceedings of International Conference on Machine Learning[C].New York,USA:ACM,2019.7354-7363.
[12] Zhang X,Zhao J,LeCun Y.Character-level convolutional networks for text classification[A].Advances in Neural Information Processing Systems[C].Cambridge,UK:MIT Press,2015.649-657.
[13] Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural computation,1997,9(8):1735-1780.
[14] Ren S,He K,Girshick R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[A].Advances in Neural Information Processing Systems[C].Cambridge,UK:MIT Press,2015.91-99.
[15] Anderson P,He X,Buehler C,et al.Bottom-up and top-down attention for image captioning and visual question answering[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2018.6077-6086.
[16] Huang Z,Xu W,Yu K.Bidirectional LSTM-CRF models for sequence tagging[OL].http://arxiv.org/abs/1508.01991,2015-08-09.
[17] Mnih V,Heess N,Graves A.Recurrent models of visual attention[A].Advances in Neural Information Processing Systems[C].Cambridge,UK:MIT Press,2014.2204-2212.
[18] Hoffer E,Ailon N.Deep metric learning using triplet network[A].Proceedings of International Workshop on Similarity-Based Pattern Recognition[C].Cham,Swit:Springer,2015.84-92.
[19] Karpathy A,Fei-Fei L.Deep visual-semantic alignments for generating image descriptions[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2015.3128-3137.
[20] Yan F,Mikolajczyk K.Deep correlation for matching images and text[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2015.3441-3450.
[21] Huang Y,Wang W,Wang L.Instance-aware image and sentence matching with selective multimodal LSTM[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2017.2310-2318.
[22] Gu J,Cai J,Joty S,et al.Look,imagine and match:Improving textual-visual cross-modal retrieval with generative models[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].Los Alamitos,USA:IEEE Computer Society,2018.7181-7189.
[23] Niu Z,Zhou M,Wang L,et al.Hierarchical multimodal LSTM for dense visual-semantic embedding[A].Proceedings of the IEEE International Conference on Computer Vision[C].Piscataway,USA:IEEE,2017.1881-1889.
[24] Ma L,Lu Z,Shang L,et al.Multimodal convolutional neural networks for matching image and sentence[A].Proceedings of the IEEE International Conference on Computer Vision[C].Piscataway,USA:IEEE,2015.2623-2631.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金 (No.61663004,No.61966004,No.61866004,No.61762078); 广西自然科学基金 (No.2019GXNSFDA245018,No.2018GXNSFDA281009,No.2017GXNSFAA198365)
{{custom_fund}}