一种基于对比学习大模型的视觉定位方法

陆庆阳; 袁广林; 朱虹; 秦晓燕; 薛模根

doi:10.12263/DZXB.20230364

您当前的位置：

首页 >

文章列表页 >

一种基于对比学习大模型的视觉定位方法

学术论文 | 更新时间：2026-05-07

- 一种基于对比学习大模型的视觉定位方法
- A Visual Grounding Method with Contrastive Learning Large Model
- 电子学报 2024年52卷第10期页码：3448-3458
- 作者机构：
  
  1.中国人民解放军陆军炮兵防空兵学院研究生大队，安徽合肥 230031
  2.中国人民解放军陆军炮兵防空兵学院信息工程系，安徽合肥 230031
  3.偏振光成像探测技术安徽省重点实验室，安徽合肥 230031
- 作者简介：
  
  [ "陆庆阳男，1994年生，安徽合肥人.陆军炮兵防空兵学院硕士研究生，主要研究方向为计算机视觉领域的多模态目标跟踪及视觉计数.E-mail: lqy465813@163.com" ]
  [ "袁广林男，1973年生，河南周口人，博士，教授.主要从事计算机视觉、机器学习及其应用方面的研究.E-mail: yuangl_plus@126.com" ]
  [ "朱虹女，1987年生，河北博野人，硕士，目前正在国防科技大学攻读博士学位，主要研究方向为视觉定位和视觉跟踪.E-mail: candy_zhuhong@126.com" ]
  [ "秦晓燕女，1980年生，安徽淮北人，副教授.主要研究方向为目标检测、机器学习及应用.E-mail: xiaoyanqin_hf@163.com" ]
  薛模根男，1964年生，安徽合肥人，博士，教授.现任中国人民解放军陆军炮兵防空学院正教授、安徽省偏振成像探测技术重点实验室主任. 主要从事图像处理、光电检测和物体跟踪方面研究.
- 基金信息：
- DOI：10.12263/DZXB.20230364
  中图分类号： TP391.4;
- 收稿：2023-04-21，
  
  修回：2023-10-18，
  
  纸质出版：2024-10-25
- 稿件说明：
移动端阅览
陆庆阳, 袁广林, 朱虹, 等. 一种基于对比学习大模型的视觉定位方法[J]. 电子学报, 2024, 52(10): 3448-3458.

LU Qing-yang, YUAN Guang-lin, ZHU Hong, et al. A Visual Grounding Method with Contrastive Learning Large Model[J]. Acta Electronica Sinica, 2024, 52(10): 3448-3458.
陆庆阳, 袁广林, 朱虹, 等. 一种基于对比学习大模型的视觉定位方法[J]. 电子学报, 2024, 52(10): 3448-3458. DOI：10.12263/DZXB.20230364

LU Qing-yang, YUAN Guang-lin, ZHU Hong, et al. A Visual Grounding Method with Contrastive Learning Large Model[J]. Acta Electronica Sinica, 2024, 52(10): 3448-3458. DOI：10.12263/DZXB.20230364

摘要

一阶段视觉定位方法由于其快速性而受到广泛关注，该方法利用图像与文本的融合特征预测目标框，但是现有方法在特征融合前没有进行图像与文本特征的对齐，限制了视觉定位的精度.为了解决这一问题，本文提出一种基于对比学习大模型的视觉定位方法.该方法采用基于对比学习的大规模预训练模型CLIP（Contrastive Language-Image Pre-training）提取图像和文本特征，利用Transformer编码器融合图像文本特征，使用多层感知机和融合特征预测目标框.该方法能够解决视觉定位方法上述不足的原因在于：借助CLIP模型的编码器可以提取高度语义对齐的图像和文本特征，同时使用全局注意力交互融合图像与文本的上下文特征.在5个数据集上，对本文提出的方法进行实验验证，实验结果表明：相比于现有视觉定位方法，本文方法取得了综合精度的提升.

Abstract

The one-stage visual grounding method has received widespread attention due to its speed

which uses fused features of images and text to predict target boxes. However

existing methods do not align image and text features before feature fusion

which limits the accuracy of visual grounding. To solve this problem

this paper proposes a visual grounding method based on contrastive learning large model. This method extracts features of image and text with CLIP(Contrastive Language-Image Pre-training) which is a large-scale pre-trained model based on contrastive learning. It uses Transformer encoders to fuse the image-text features and predicts target boxes using multi-layer perceptron and fused features. The method can overcome the above shortcomings for the following reasons: It can extract highly aligned image-text features in semantics via the CLIP encoders. Meanwhile

it uses global attention to interactively fuse contextual features of images and text. The proposed method was experimentally validated on five datasets

and the experimental results show that compared to existing visual grounding methods

the proposed method has achieved an improvement in overall accuracy.

关键词

Keywords

references

LIU J Y , WANG L , YANG M H . Referring expression generation and comprehension via attributes [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 4866 - 4874 .

ZHANG H W , NIU Y L , CHANG S F . Grounding referring expressions in images by variational context [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 4158 - 4166 .

HU R H , ROHRBACH M , ANDREAS J , et al . Modeling relationships in referential expressions with compositional modular networks [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 1115 - 1124 .

YE J B , LIN X , HE L , et al . One-stage visual grounding via semantic-aware feature filter [C ] // Proceedings of the 29th ACM International Conference on Multimedia . New York : ACM , 2021 : 1702 - 1711 .

LIU X H , WANG Z H , SHAO J , et al . Improving referring expression grounding with cross-modal attention-guided erasing [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 1950 - 1959 .

ZHUANG B H , WU Q , SHEN C H , et al . Parallel attention: A unified framework for visual object discovery through dialogs and queries [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 4252 - 4261 .

WANG P , WU Q , CAO J W , et al . Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 1960 - 1968 .

YANG S B , LI G B , YU Y Z . Dynamic graph attention for referring expression comprehension [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 4644 - 4653 .

CIRIK V , BERG-KIRKPATRICK T , MORENCY L P . Using syntax to ground referring expressions in natural images [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto : AAAI , 2018 : 6756 - 6764 .

LIU D Q , ZHANG H W , ZHA Z J , et al . Learning to assemble neural module tree networks for visual grounding [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 4673 - 4682 .

YANG Z Y , GONG B Q , WANG L W , et al . A fast and accurate one-stage approach to visual grounding [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 4682 - 4692 .

HUANG B B , LIAN D Z , LUO W X , et al . Look before you leap: Learning landmark features for one-stage visual grounding [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 16888 - 16897 .

DENG J J , YANG Z Y , CHEN T L , et al . TransVG: End-to-end visual grounding with Transformers [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 1769 - 1779 .

DU Y , FU Z H , LIU Q J , et al . Visual grounding with Transformers [C ] // 2022 IEEE International Conference on Multimedia and Expo (ICME) . Piscataway : IEEE , 2022 : 1 - 6 .

ZHU C Y , ZHOU Y Y , SHEN Y H , et al . SeqTR: A simple yet universal network for visual grounding [M ] // Lecture Notes in Computer Science . Cham : Springer Nature Switzerland , 2022 : 598 - 615 .

YANG L , XU Y , YUAN C F , et al . Improving visual grounding with visual-linguistic verification and iterative reasoning [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 9499 - 9508 .

QU M X , WU Y , LIU W , et al . SiRi: A simple selective retraining mechanism for Transformer-based visual grounding [M ] // Lecture Notes in Computer Science . Cham : Springer Nature Switzerland , 2022 : 546 - 562 .

LI K , LI J X , GUO D , et al . Transformer-based visual grounding with cross-modality interaction [J ] . ACM Transactions on Multimedia Computing, Communications, and Applications , 2023 , 19 ( 6 ): 1 - 19 .;

RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [C ] // Proceedings of the 38th International Conference on Machine Learning , ICML 2021 . New York : ICML , 2021: 8748 - 8763 .

UIJLINGS J R R , VAN DE SANDE K E A , GEVERS T , et al . Selective search for object recognition [J ] . International Journal of Computer Vision , 2013 , 104 ( 2 ): 154 - 171 .

REN S , HE K , GIRSHICK R , et al . Faster R-CNN: Towards real-time object detection with region proposal networks [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017 , 39 ( 6 ): 1137 - 1149 .

REDMON J , FARHADI A . YOLOv3: An incremental improvement [EB/OL ] . ( 2018-04-08 )[ 2022-09-27 ] . https://arxiv.org/abs/1804.02767 https://arxiv.org/abs/1804.02767 .

DEVLIN J , CHANG M W , LEE K , et al . BERT: Pre-training of deep bidirectional Transformers for language understanding [C ] // Proceedings of NAACL-HLT . Stroudsburg : ACL , 2019 : 4171 - 4186 .

LIAO Y , LIU S , LI G B , et al . A real-time cross-modality correlation filtering method for referring expression comprehension [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 10880 - 10889 .

YANG Z Y , CHEN T L , WANG L W , et al . Improving one-stage visual grounding by recursive sub-query construction [M ] // Lecture Notes in Computer Science . Cham : Springer International Publishing , 2020 : 387 - 404 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 6000 - 6010 .

CARION N , MASSA F , SYNNAEVE G , et al . End-to-end object detection with Transformers [M ] // Lecture Notes in Computer Science . Cham : Springer International Publishing , 2020 : 213 - 229 .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 770 - 778 .

DOSOVITSKIY A , BEYER L , KOLESNIKOV A , et al . An Image is Worth 16 x 16 Words: Transformers for Image Recognition at Scale[EB/OL ] . ( 2020-10-21 )[ 2022-03-09 ] . https://arxiv.org/abs/2010.11929 https://arxiv.org/abs/2010.11929 , arXiv: 2010.11929 .

SENNRICH R , HADDOW B , BIRCH A . Neural machine translation of rare words with subword units [EB/OL ] . ( 2016-06-10 )[ 2022-10-24 ] . https://arxiv.org/abs/1508.07909 https://arxiv.org/abs/1508.07909 .

CHATTOPADHAY A , SARKAR A , HOWLADER P , et al . Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks [C ] // 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) . Piscataway : IEEE , 2018 : 839 - 847 .

KAZEMZADEH S , ORDONEZ V , MATTEN M , et al . ReferItGame: Referring to objects in photographs of natural scenes [C ] // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Stroudsburg, PA, USA : Association for Computational Linguistics , 2014 : 787 - 798 .

ESCALANTE H J , HERNÁNDEZ C A , GONZALEZ J A , et al . The segmented and annotated IAPR TC-12 benchmark [J ] . Computer Vision and Image Understanding , 2010 , 114 ( 4 ): 419 - 428 .

PLUMMER B A , WANG L W , CERVANTES C M , et al . Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models [C ] // 2015 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2015 : 2641 - 2649 .

YOUNG P , LAI A , HODOSH M , et al . From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions [J ] . Transactions of the Association for Computational Linguistics , 2014 , 2 : 67 - 78 .

JOSEPH C R . Common objects in context dataset mirror [EB/OL ] . [ 2014-09-01 ] [ 2022-10-25 ] . https://pjreddie.com/projects/coco-mirror/ https://pjreddie.com/projects/coco-mirror/ .

YU L C , POIRSON P , YANG S , et al . Modeling context in referring expressions [M ] // Lecture Notes in Computer Science . Cham : Springer International Publishing , 2016 : 69 - 85 .

MAO J H , HUANG J , TOSHEV A , et al . Generation and comprehension of unambiguous object descriptions [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 11 - 20 .

NAGARAJA V K , MORARIU V I , DAVIS L S . Modeling context between objects for referring expression understanding [M ] // Lecture Notes in Computer Science . Cham : Springer International Publishing , 2016 : 792 - 807 .

YU L C , LIN Z , SHEN X H , et al . MAttNet: Modular attention network for referring expression comprehension [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 1307 - 1315 .

WANG L W , LI Y , HUANG J , et al . Learning two-branch neural networks for image-text matching tasks [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2019 , 41 ( 2 ): 394 - 407 .

PLUMMER B A , KORDAS P , KIAPOUR M H , et al . Conditional image-text embedding networks [M ] // Lecture Notes in Computer Science . Cham : Springer International Publishing , 2018 : 258 - 274 .

KOVVURI R , NEVATIA R . PIRC Net: Using proposal indexing, relationships and context for phrase grounding [M ] // Lecture Notes in Computer Science . Cham : Springer International Publishing , 2019 : 451 - 467 .

YU Z , YU J , XIANG C , et al . Rethinking diversified and discriminative proposal generation for visual grounding [EB/OL ] . ( 2018-05-09 )[ 2022-06-28 ] . https://arxiv.org/abs/1805.03508 https://arxiv.org/abs/1805.03508 .

CHEN X , MA L , CHEN J , et al . Real-time referring expression comprehension by single-stage grounding network [EB/OL ] . ( 2018-12-09 )[ 2022-09-28 ] . https://arxiv.org/abs/1812.03426 https://arxiv.org/abs/1812.03426 .

SADHU A , CHEN K , NEVATIA R . Zero-shot grounding of objects from natural language queries [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 4694 - 4703 .

HONG R , LIU D , MO X , et al . Learning to compose and reason with language tree structures for visual grounding [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 44 ( 2 ): 684 - 696 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

特征掩码与对比学习融合多维度去过度相关的序列推荐

基于文本语义引导的红外与可见光图像融合方法

基于扩散伪影对比学习的生成式图像检测方法

一种融合连续小波卷积与图嵌入的注意力网络

模型互联中多模型串并联协作推理