电子学报 ›› 2021, Vol. 49 ›› Issue (2): 268-274.DOI: 10.12263/DZXB.20191037

• 学术论文 • 上一篇    下一篇

融合两级相似度的跨媒体图像文本检索

李志欣1, 凌锋1, 张灿龙1, 马慧芳2   

  1. 1. 广西师范大学广西多源信息挖掘与安全重点实验室, 广西桂林 541004;
    2. 西北师范大学计算机科学与工程学院, 甘肃兰州 730070
  • 收稿日期:2019-09-10 修回日期:2020-09-20 出版日期:2021-02-25
    • 通讯作者:
    • 李志欣
    • 作者简介:
    • 凌锋 男,1993年11月出生,广西崇左人.广西师范大学计算机科学与信息工程学院硕士研究生.研究方向为机器学习与跨媒体检索.E-mail:lingfeng93@126.com;张灿龙 男,1975年10月出生,湖南娄底人.现为广西师范大学计算机科学与信息工程学院教授.研究领域为目标跟踪与模式识别;马慧芳 女,1981年7月出生,甘肃兰州人.现为西北师范大学计算机科学与工程学院教授.研究领域为数据挖掘与机器学习.
    • 基金资助:
    • 国家自然科学基金 (No.61663004,No.61966004,No.61866004,No.61762078); 广西自然科学基金 (No.2019GXNSFDA245018,No.2018GXNSFDA281009,No.2017GXNSFAA198365)

Cross-Media Image-Text Retrieval with Two Level Similarity

LI Zhi-xin1, LING Feng1, ZHANG Can-long1, MA Hui-fang2   

  1. 1. Guangxi Key Laboratory of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin, Guangxi 541004, China;
    2. College of Computer Science and Engineering, Northwest Normal University, Lanzhou, Gansu 730070, China
  • Received:2019-09-10 Revised:2020-09-20 Online:2021-02-25 Published:2021-02-25
    • Corresponding author:
    • LI Zhi-xin
    • Supported by:
    • National Natural Science Foundation of China (No.61663004, No.61966004, No.61866004, No.61762078); Natural Science Foundation of Guangxi Zhuang Autonomous Region,  China (No.2019GXNSFDA245018, No.2018GXNSFDA281009, No.2017GXNSFAA198365)

摘要: 为了更好地揭示图像和文本之间潜在的语义关联,提出了一种融合两级相似度的跨媒体检索方法,构建两个子网分别处理全局特征和局部特征,以获取图像和文本之间更好的语义匹配.图像分为整幅图像和一些图像区域两种表示,文本也分为整个语句和一些单词两种表示.设计一个两级对齐方法分别匹配图像和文本的全局和局部表示,并融合两种相似度学习跨媒体的完整表示.在MSCOCO和Flickr30K数据集上的实验结果表明,本文方法能够使图像和文本的语义匹配更准确,优于许多当前先进的跨媒体检索方法.

关键词: 卷积神经网络, 自注意力网络, 两级相似度, 跨媒体检索

Abstract: To better reveal the latent semantic correlation between image and text, this paper proposes a cross media retrieval method by fusing two level similarity, which constructs two subnets to deal with global features and local features respectively so as to obtain better semantic matching between image and text. The image representation is divided into whole image and some image regions, and the text representation is also divided into whole sentence and some words. A two level alignment method is designed to match the global and local representation of image and text, and the two similarities are fused to learn the complete cross-media representation. The experimental results on MSCOCO and Flickr30K datasets show that the proposed method can make the semantic matching of image and text more accurate, and is superior to many state-of-the-art cross-media retrieval methods.

Key words: convolutional neural network, self-attention network, two level similarity, cross-media retrieval

中图分类号: