Multimodal Pretraining with Cross-Modal Guidance and Alignment

CAI Hua; YI Ya-xi; FU Qiang; RAN Yue; SUN Jun-xi

doi:10.12263/DZXB.20240271

您当前的位置：

首页 >

文章列表页 >

Multimodal Pretraining with Cross-Modal Guidance and Alignment

PAPERS | 更新时间：2026-05-07

- Multimodal Pretraining with Cross-Modal Guidance and Alignment
- ACTA ELECTRONICA SINICA Vol. 52, Issue 10, Pages: 3368-3381(2024)
- 作者机构：
  
  1.长春理工大学电子信息工程学院，吉林长春 130022
  2.长春中国光学科学技术馆，吉林长春 130117
  3.长春理工大学空间光电技术研究所，吉林长春 130022
  4.东北师范大学信息科学与技术学院，吉林长春 130117
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(61890963;U2341226);Jilin Province Talent Development Special Fund(20240602015RC);Xi'an Key Laboratory of Aircraft Optical Imaging and Measurement Technology(2023-13)
- DOI：10.12263/DZXB.20240271
  CLC： TP391
- Received：28 March 2024，
  
  Revised：2024-07-28，
  
  Published：25 October 2024
- 稿件说明：
移动端阅览
才华, 易亚希, 付强, 等. 基于跨模态引导和对齐的多模态预训练方法[J]. 电子学报, 2024, 52(10): 3368-3381.

CAI Hua, YI Ya-xi, FU Qiang, et al. Multimodal Pretraining with Cross-Modal Guidance and Alignment[J]. Acta Electronica Sinica, 2024, 52(10): 3368-3381.
才华, 易亚希, 付强, 等. 基于跨模态引导和对齐的多模态预训练方法[J]. 电子学报, 2024, 52(10): 3368-3381. DOI：10.12263/DZXB.20240271

CAI Hua, YI Ya-xi, FU Qiang, et al. Multimodal Pretraining with Cross-Modal Guidance and Alignment[J]. Acta Electronica Sinica, 2024, 52(10): 3368-3381. DOI：10.12263/DZXB.20240271

摘要

现有的视觉语言多模态预训练方法仅在图像和文本的全局语义上进行特征对齐，对模态间细粒度特征交互的探索不足.针对这一问题，本文提出了一种基于跨模态引导和对齐的多模态预训练方法.该方法在模态特征提取阶段，采用基于视觉序列压缩的双流特征提取网络，在视觉编码器中联合图像和文本信息逐层引导视觉序列压缩，缓解与文本无关的冗余视觉信息对模态间细粒度交互的干扰；在模态特征对齐阶段，对图像和文本特征进行细粒度关系推理，实现视觉标记与文本标记的局部特征对齐，增强对模态间细粒度对齐关系的理解.实验结果表明，本文方法能够更好地对齐视觉文本的细粒度特征，在图文检索任务中，微调后的图像检索和文本检索的平均召回率分别达到了86.4%和94.88%，且零样本图文检索的整体指标相较于经典图文检索算法CLIP（Contrastive Language-Image Pre-training）提升了5.36%，在视觉问答等分类任务中，准确率也优于目前主流多模态预训练方法.

Abstract

Current multimodal pre-training techniques for visual languages predominantly focus on aligning global semantic features between images and text

yet they inadequately explore the granular feature interactions between modalities. Addressing this gap

this paper proposes a novel multimodal pre-training strategy informed by cross-modal guidance and alignment. Our method employs a dual-stream feature extraction network designed for visual sequence compression

to facilitate modality feature extraction. During this phase

a synergistic image-text guidance is integrated within the visual encoder

orchestrating the compression of visual sequences layer by layer. This approach mitigates the obfuscation of modality-specific fine-grained interactions by irrelevant visual information. Subsequently

in the modality feature alignment phase

we implement fine-grained relational reasoning on the image and textual features to achieve localized feature alignment among visual tokens and textual tokens. This advancement bolsters the model's comprehension of fine-grained alignment relationships. After fine-tuning

in the image-text retrieval tasks

our approach achieves an average recall rate of 86.4% for images and 94.88% for texts

which represents a significant 5.36% improvement in zero-shot image-text retrieval over the canonical CLIP (Contrastive Language-Image Pre-training) algorithm. Moreover

our method also surpasses existing mainstream multimodal pre-training methods in accuracy for classification tasks like visual question answering.

关键词

Keywords

references

ABDU S A , YOUSEF A H , SALEM A . Multimodal video sentiment analysis using deep learning approaches, a survey [J ] . Information Fusion , 2021 , 76 : 204 - 226 .

ACOSTA J N , FALCONE G J , RAJPURKAR P , et al . Multimodal biomedical AI [J ] . Nature Medicine , 2022 , 28 ( 9 ): 1773 - 1784 .

樊琳 , 龚勋 , 郑岑洋 . 基于文本引导下的多模态医学图像分析算法 [J ] . 电子学报 , 2024 , 52 ( 7 ): 2498 - 2512 .

FAN L , GONG X , ZHENG C Y . A Multi-modal medical image analysis algorithm based on text guidance [J ] . Acta Electronica Sinica , 2024 , 52 ( 7 ): 2498 - 2512 . (in Chinese)

HUANG K L , SHI B T , LI X , et al . Multi-modal sensor fusion for auto driving perception: a survey [EB/OL ] . ( 2022-02-06 )[ 2024-03-25 ] . https://doi.org/10.48550/arXiv.2202. 02703 https://doi.org/10.48550/arXiv.2202.02703 .

TAN H , BANSAL M . LXMERT: Learning cross-modality encoder representations from Transformers [C ] // Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . Stroudsburg : Association for Computational Linguistics , 2019 : 5100 - 5111 .

CHEN Y C , LI L J , YU L C , et al . UNITER: Universal image-text representation learning [M ] // Computer Vision — ECCV 2020 . Cham : Springer International Publishing , 2020 : 104 - 120 .

LI X J , YIN X , LI C Y , et al . Oscar: Object-semantics aligned pre-training for vision-language tasks [M ] // Computer Vision — ECCV 2020 . Cham : Springer International Publishing , 2020 : 121 - 137 .

RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [EB/OL ] . ( 2021-02-26 )[ 2024-03-25 ] . http://arxiv.org/abs/2103.00020 http://arxiv.org/abs/2103.00020 .

JIA C , YANG Y F , XIA Y , et al . Scaling up visual and vision-language representation learning with noisy text supervision [C ] // Proceedings of the 38th International Conference on Machine Learning . New York : PMLR , 2021 , 139 : 4904 - 4916 .

YU J H , WANG Z R , VASUDEVAN V , et al . Coca: Contrastive captioners are image-text foundation models [EB/OL ] . ( 2022-06-14 )[ 2024-03-25 ] . https://doi.org/10.48550/arXiv.2205.01917 https://doi.org/10.48550/arXiv.2205.01917 .

BAO H B , WANG W H , DONG L , et al . VLMo: Unified vision-language pre-training with mixture-of-modality-experts [C ] // Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS) . New York : Curran Associates , 2022 , 35 : 32897 - 32912 .

李志欣 , 凌锋 , 张灿龙 , 等 . 融合两级相似度的跨媒体图像文本检索 [J ] . 电子学报 , 2021 , 49 ( 2 ): 268 - 274 .

LI Z X , LING F , ZHANG C L , et al . Cross-media image-text retrieval with two level similarity [J ] . Acta Electronica Sinica , 2021 , 49 ( 2 ): 268 - 274 . (in Chinese)

YAO L W , HUANG R H , HOU L , et al . FILIP: fine-grained interactive language-image pre-training [C/OL ] // The Tenth International Conference on Learning Representations . (2022-01-29)[2024-03-25] . https://openreview.net/forum?id=cpDhcsEDC2 https://openreview.net/forum?id=cpDhcsEDC2 .

LI J N , SELVARAJU R , GOTMARE A , et al . Align before fuse: Vision and language representation learning with momentum distillation [C ] // Proceedings of the 35th International Conference on Neural Information Processing Systems (NeurIPS) . New York : Curran Associates , 2021 , 34 : 9694 - 9705 .

DOSOVITSKIY A , BEYER L , KOLESNIKOV A , et al . An image is worth 16 x 16 words: Transformers for image recognition at scale[EB/OL ] . ( 2020-10-22 )[ 2024-03-25 ] . https://doi.org/10.48550/arXiv.2010.11929 https://doi.org/10.48550/arXiv.2010.11929 .

LI C L , XU H Y , TIAN J F , et al . mPLUG: Effective and efficient vision-language learning by cross-modal skip-connections [C ] // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing . Stroudsburg : Association for Computational Linguistics , 2022 : 7241 - 7259 .

TANG Y H , HAN K , WANG Y H , et al . Patch slimming for efficient vision Transformers [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 12165 - 12174 .

RAO Y M , ZHAO W L , LIU B L , et al . DynamicViT: Efficient vision Transformers with dynamic token sparsification [C ] // Proceedings of the 35th International Conference on Neural Information Processing Systems (NeurIPS) . New York : Curran Associates , 2021 , 34 : 13937 - 13949 .

LIANG Y W , GE C J , TONG Z , et al . EViT: Expediting vision Transformers via token reorganizations [C/OL ] // The Tenth International Conference on Learning Representations . (2022-01-29)[2024-03-25] . https://openreview.net/forum?id=BjyvwnXXVn_ https://openreview.net/forum?id=BjyvwnXXVn_ .

WEI S Y , YE T Z , ZHANG S , et al . Joint token pruning and squeezing towards more aggressive compression of vision Transformers [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 2092 - 2101 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C ] // Proceedings of the 31th International Conference on Neural Information Processing Systems (NeurIPS) . New York : Curran Associates , 2017 : 6000 - 6010 .

BAO H B , DONG L , PIAO S H , et al . BEIT: BERT pre-training of image Transformers [C/OL ] // The Tenth International Conference on Learning Representations . (2022-01-29)[2024-03-25] . https://openreview.net/forum?id=p-BhZSz59o4 https://openreview.net/forum?id=p-BhZSz59o4 .

汤嘉 , 郭燕 , 叶名玮 , 等 . 面向多视角对比学习和语义增强的多模态预训练方法 [J ] . 计算机科学 , 2024 , 51 ( 1 ): 168 - 174 .

TANG J , GUO Y , YE M W , et al . Multimodal pre-training method for multi-view contrastive learning and semantic enhancement [J ] . Computer Science , 2024 , 51 ( 1 ): 168 - 174 . (in Chinese)

DEVLIN J , CHANG M W , LEE K , et al . BERT: Pre-training of deep bidirectional Transformers for language understanding [C ] // Proceedings of NAACL-HLT . Stroudsburg : ACL , 2019 : 4171 - 4186 .

LIN T Y , MAIRE M , BELONGIE S , et al . Microsoft COCO: Common objects in context [C ] // Computer Vision — ECCV 2014 . Cham : Springer International Publishing , 2014 : 740 - 755 .

KRISHNA R , ZHU Y K , GROTH O , et al . Visual genome: Connecting language and vision using crowdsourced dense image annotations [J ] . International Journal of Computer Vision , 2017 , 123 ( 1 ): 32 - 73 .

ORDONEZ V , KULKARNI G , BERG T L . Im2Text: Describing images using 1 million captioned photographs [C ] // Proceedings of the 25th International Conference on Neural Information Processing Systems . New York : Curran Associates , 2011 , 24 : 1143 - 1151 .

SHARMA P , DING N , GOODMAN S , et al . Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning [C ] // Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Stroudsburg : Association for Computational Linguistics , 2018 : 2556 - 2565 .

YOUNG P , LAI A , HODOSH M , et al . From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions [J ] . Transactions of the Association for Computational Linguistics , 2014 , 2 : 67 - 78 .

QI D , SU L , SONG J , et al . ImageBERT: Cross-modal pre-training with large-scale weak-supervised image-text data [EB/OL ] . ( 2020-01-23 )[ 2024-03-25 ] . https://doi.org/10.48550/arXiv.2001.07966 https://doi.org/10.48550/arXiv.2001.07966 .

DOU Z Y , XU Y C , GAN Z , et al . An empirical study of training end-to-end vision-and-language Transformers [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 18166 - 18176 .

YANG J Y , DUAN J L , TRAN S , et al . Vision-language pre-training with triple contrastive learning [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 15671 - 15680 .

WANG W H , YANG Z , XU B , et al . ViLTA: Enhancing vision-language pre-training through textual augmentation [C ] // 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2023 , 3158 - 3169 .

KWON G , CAI Z W , RAVICHANDRAN A , et al . Masked vision and language modeling for multi-modal representation learning [EB/OL ] . ( 2022-08-03 )[ 2024-3-25 ] . https://doi.org/10.48550/arXiv.2208.02131 https://doi.org/10.48550/arXiv.2208.02131 .

BI J Y , CHENG D X , YAO P , et al . VL-match: Enhancing vision-language pretraining with token-level and instance-level matching [C ] // 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2023 , 2584 - 2593 .

LI J N , LI D X , XIONG C M , et al . Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation [C ] // Proceedings of the 39th International Conference on Machine Learning . New York : PMLR , 2022 , 162 : 12888 - 12900 .

AGRAWAL A , LU J S , ANTOL S , et al . VQA: Visual question answering [J ] . International Journal of Computer Vision , 2017 , 123 ( 1 ): 4 - 31 .

BYUN J , HWANG T , FU J L , et al . GRIT-VLP: Grouped mini-batch sampling for efficient vision and language pre-training [C ] // Computer Vision — ECCV 2022 . Cham : Springer Nature Switzerland , 2022 : 395 - 412 .

ZHANG H T , ZHANG P C , HU X W , et al . GLIPv-2: Unifying localization and VL understanding [EB/OL ] . ( 2022-06-12 )[ 2024-3-25 ] . https://doi.org/10.48550/arXiv.2206.05836 https://doi.org/10.48550/arXiv.2206.05836 .

WANG Z R , YU J H , YU A W , et al . SimVLM: Simple visual language model pretraining with weak supervision [EB/OL ] . ( 2021-08-24 )[ 2024-3-25 ] . https://doi.org/10.48550/arXiv.2108.10904 https://doi.org/10.48550/arXiv.2108.10904 .

WANG P , YANG A , MEN R , et al . OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework [C ] // Proceedings of the 39th International Conference on Machine Learning . New York : PMLR , 2022 , 162 : 23318 - 23340 .

SUHR A , ZHOU S , ZHANG A , et al . A corpus for reasoning about natural language grounded in photographs [C ] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : Association for Computational Linguistics , 2019 : 6418 - 6428 .

XIE N , LAI F , DORAN D , et al . Visual entailment: A novel task for fine-grained image understanding [EB/OL ] . ( 2019-01-20 )[ 2024-03-25 ] . https://doi.org/10.48550/arXiv.1901.06706 https://doi.org/10.48550/arXiv.1901.06706 .

刘天义 , 吴祖煊 , 陈静静 , 等 . 面向视觉语言理解与生成的多模态预训练方法 [J ] . 软件学报 , 2023 , 34 ( 5 ): 2024 - 2034 .

LIU T Y , WU Z X , CHEN J J , et al . Multimodal pre-training method for vision-language understanding and generation [J ] . Journal of Software , 2023 , 34 ( 5 ): 2024 - 2034 . (in Chinese)

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

⁰