1.浙江工业大学计算机学院,浙江杭州 310023
2.中国电子口岸数据中心杭州分中心,浙江杭州 310008
[ "秦钰淑 女,1999年4月出生于山西省长治市.现为浙江工业大学硕士研究生.主要研究方向为多模态检索. E-mail: qinyushu1999@163.com" ]
[ "杨良怀 男,1967年出生于浙江省新昌县.现为浙江工业大学计算机学院教授.主要研究方向为数据科学与工程. E-mail: yanglh@zjut.edu.cn" ]
[ "朱艳超 女,1981年11月出生于浙江省丽水市.毕业于大连海事大学.主要研究方向为信息系统. E-mail: dongdong7981@126.com" ]
[ "龚卫华 男,1977年生于湖北省武汉市汉阳区.现为浙江工业大学计算机学院副教授.主要研究方向为机器学习、社会网络. E-mail: whgong@sohu.com" ]
收稿:2024-07-22,
修回:2024-11-01,
纸质出版:2025-02-25
移动端阅览
秦钰淑, 杨良怀, 朱艳超, 等. 融合图像与文本特征的组合检索方法[J]. 电子学报, 2025, 53(02): 558-567.
QIN Yu-shu, YANG Liang-huai, ZHU Yan-chao, et al. A Combined Retrieval Method by Fusing Image and Text Features[J]. Acta Electronica Sinica, 2025, 53(02): 558-567.
秦钰淑, 杨良怀, 朱艳超, 等. 融合图像与文本特征的组合检索方法[J]. 电子学报, 2025, 53(02): 558-567. DOI:10.12263/DZXB.20240679
QIN Yu-shu, YANG Liang-huai, ZHU Yan-chao, et al. A Combined Retrieval Method by Fusing Image and Text Features[J]. Acta Electronica Sinica, 2025, 53(02): 558-567. DOI:10.12263/DZXB.20240679
随着电商领域图像数据的爆炸式增长,针对目标图像的检索成为信息检索研究中的挑战性工作.现有的传统图像检索模型仅依靠单一文本描述或相似图像,难以准确捕捉用户的检索意图,导致检索结果不理想.为了解决该难题,本文提出了一种融合图像与文本特征的组合检索方法,采用Swin Transformer(SwinT)提取参考图像的多层特征,将图像与文本特征在多个层级上进行融合,使文本特征能够多层次、细粒度地修改参考图像特征,以更接近目标图像特征.然后,将修改后的图像特征与目标图像特征嵌入到一个空间中进行相似性度量,并采用基于批次的分类损失来优化检索性能.在Fashion200k、MIT-States和CSS这3个数据集上的实验结果表明,相较于现有主流方法,本文方法在性能上平均提升了5个百分点.
With the explosive growth of image data in the field of e-commerce
target image retrieval has become a challenging work in information retrieval research. The existing traditional image retrieval models only rely on a single text description or similar image
which is difficult to accurately capture the user’s retrieval intention
resulting in unsatisfactory retrieval results. In order to solve this problem
this paper proposes a combined retrieval method that fuses image and text features. Swin Transformer (SwinT) is used to extract the multi-layer features of the reference image
and the image and text features are fused at multiple levels
so that the text features can modify the reference image features at multi-level and fine-grained
and get closer to the target image features. Then
the modified image features and the target image features are embedded in a space for similarity measurement
and the batch-based classification loss is used to optimize the retrieval performance. Experimental results on Fashion200k
MIT-States and CSS datasets show that the proposed method improves the performance by 5 percentage points on average compared with the existing mainstream methods.
DUBEY S R . A decade survey of content based image retrieval using deep learning [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2022 , 32 ( 5 ): 2687 - 2704 .
LI X Q , YANG J S , MA J W . Recent developments of content-based image retrieval (CBIR) [J ] . Neurocomputing , 2021 , 452 : 675 - 689 .
ZHANG Q , LEI Z , ZHANG Z X , et al . Context-aware attention network for image-text retrieval [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 3533 - 3542 .
李志欣 , 凌锋 , 张灿龙 , 等 . 融合两级相似度的跨媒体图像文本检索 [J ] . 电子学报 , 2021 , 49 ( 2 ): 268 - 274 .
LI Z X , LING F , ZHANG C L , et al . Cross-media image-text retrieval with two level similarity [J ] . Acta Electronica Sinica , 2021 , 49 ( 2 ): 268 - 274 . (in Chinese)
冯霞 , 胡志毅 , 刘才华 . 跨模态检索研究进展综述 [J ] . 计算机科学 , 2021 , 48 ( 8 ): 13 - 23 .
FENG X , HU Z Y , LIU C H . Survey of research progress on cross-modal retrieval [J ] . Computer Science , 2021 , 48 ( 8 ): 13 - 23 . (in Chinese)
冯奕 , 周晓松 , 李传艺 , 等 . 基于多模态特征融合嵌入的相似广告检索方法 [J ] . 计算机学报 , 2022 , 45 ( 7 ): 1500 - 1516 .
FENG Y , ZHOU X S , LI C Y , et al . A multi-modal feature fusion embedding method for similar ad retrieving [J ] . Chinese Journal of Computers , 2022 , 45 ( 7 ): 1500 - 1516 . (in Chinese)
NECULAI A , CHEN Y B , AKATA Z . Probabilistic compositional embeddings for multimodal image retrieval [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Piscataway : IEEE , 2022 : 4546 - 4556 .
VO N , JIANG L , SUN C , et al . Composing text and image for image retrieval - an empirical odyssey [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 6432 - 6441 .
SANTORO A , RAPOSO D , BARRETT D G T , et al . A simple neural network module for relational reasoning [EB/OL ] . ( 2017-06-05 )[ 2024-07-22 ] . https://arxiv.org/abs/1706.01427v1 https://arxiv.org/abs/1706.01427v1 .
KIM J H , LEE S W , KWAK D , et al . Multimodal residual learning for visual QA [C ] // Proceedings of the 30th International Conference on Neural Information Processing Systems . New York : ACM , 2016 : 361 - 369 .
HOSSEINZADEH M , WANG Y . Composed query image retrieval using locally bounded features [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 3593 - 3602 .
CHEN Y B , GONG S G , BAZZANI L . Image search with text feedback by visiolinguistic attention learning [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 2998 - 3008 .
KIM J , YU Y , KIM H , et al . Dual compositional learning in interactive image retrieval [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 2 ): 1771 - 1779 .
WEN H K , SONG X M , YANG X , et al . Comprehensive linguistic-visual composition network for image retrieval [C ] // Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval . New York : ACM , 2021 : 1369 - 1378 .
LEE S , KIM D , HAN B . CoSMo: Content-style modulation for image retrieval with text feedback [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 802 - 812 .
GOENKA S , ZHENG Z H , JAISWAL A , et al . FashionVLP: Vision language transformer for fashion retrieval with feedback [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 14085 - 14095 .
XU Y H , BIN Y , WEI J W , et al . Multi-modal transformer with global-local alignment for composed query image retrieval [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 8346 - 8357 .
LIU Z , LIN Y T , CAO Y , et al . Swin transformer: Hierarchical vision transformer using shifted windows [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 9992 - 10002 .
GREFF K , SRIVASTAVA R K , KOUTNÍK J , et al . LSTM: A search space odyssey [J ] . IEEE Transactions on Neural Networks and Learning Systems , 2017 , 28 ( 10 ): 2222 - 2232 .
HAN X T , WU Z X , HUANG P X , et al . Automatic spatially-aware fashion concept discovery [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 1472 - 1480 .
ISOLA P , LIM J J , ADELSON E H . Discovering states and transformations in image collections [C ] // 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2015 : 1383 - 1391 .
KIM W , SON B , KIM I . ViLT: Vision-and-language transformer without convolution or region supervision [C ] // Proceedings of the International Conference on Machine Learning . Stockholm : PMLR , 2021 : 5583 - 5594 .
RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [C ] // Proceedings of the International Conference on Machine Learning . Stockholm : PMLR , 2021 : 8748 - 8763 .
LI J N , SELVARAJU R R , GOTMARE A D , et al . Align before fuse: Vision and language representation learning with momentum distillation [C ] // Proceedings of the 35th International Conference on Neural Information Processing Systems . New York : ACM , 2024 : 9694 - 9705 .
BAO H B , WANG W H , DONG L , et al . VLMo: Unified vision-language pre-training with mixture-of-modality-experts [C ] // Proceedings of the 36th International Conference on Neural Information Processing Systems . New York : ACM , 2024 : 32897 - 32912 .
DOSOVITSKIY A , BEYER L , KOLESNIKOV A , et al . An image is worth 16 x 16 words: Transformers for image recognition at scale[EB/OL ] .( 2020-10-20 )[ 2024-07-22 ] . https://arxiv.org/abs/2010.11929 https://arxiv.org/abs/2010.11929 .
HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 770 - 778 .
DEVLIN J , CHANG M W , LEE K , et al . Bert: Pre-training of deep bidirectional transformers for language understanding [C ] // Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics . Minneapolis : ACL , 2019 : 4171 - 4186 .
BROWN T , MANN B , RYDER N , et al . Language models are few-shot learners [EB/OL ] .( 2020-05-28 )[ 2024-07-22 ] . https://arxiv.org/abs/2005.14165 https://arxiv.org/abs/2005.14165 .
VAZE S , CARION N , MISRA I . GeneCIS: A benchmark for general conditional image similarity [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 6862 - 6872 .
BALDRATI A , BERTINI M , URICCHIO T , et al . Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Piscataway : IEEE , 2022 : 4955 - 4964 .
PEREZ E , STRUB F , DE VRIES H , et al . FiLM: Visual reasoning with a general conditioning layer [C ] // Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence . New York : ACM , 2018 : 3942 - 3951 .
0
浏览量
195
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621