

浏览全部资源
扫码关注微信
1.中国矿业大学计算机科学与技术学院/人工智能学院,江苏徐州 221116
2.矿山数字化教育部工程研究中心,江苏徐州 221116
Received:24 April 2025,
Accepted:29 August 2025,
Published:25 September 2025
移动端阅览
杜文亮, 许晓宇, 赵佳琦, 等. 基于共享提示与Mamba适配器的遥感图像文本检索方法[J]. 电子学报, 2025, 53(09): 3358-3370.
DU Wen-liang, XU Xiao-yu, ZHAO Jia-qi, et al. A Remote Sensing Image Text Retrieval Method Based on the Shared Prompt and Mamba Adapter[J]. Acta Electronica Sinica, 2025, 53(09): 3358-3370.
杜文亮, 许晓宇, 赵佳琦, 等. 基于共享提示与Mamba适配器的遥感图像文本检索方法[J]. 电子学报, 2025, 53(09): 3358-3370. DOI:10.12263/DZXB.20250326
DU Wen-liang, XU Xiao-yu, ZHAO Jia-qi, et al. A Remote Sensing Image Text Retrieval Method Based on the Shared Prompt and Mamba Adapter[J]. Acta Electronica Sinica, 2025, 53(09): 3358-3370. DOI:10.12263/DZXB.20250326
遥感图像文本检索旨在根据给定的图像或文本,从海量遥感图像文本数据库中快速、准确地检索出与之语义匹配的文本或图像.随着对地观测技术的飞速发展,该技术在城市规划、灾害应急响应、环境监测等领域的应用价值日益凸显,已成为当前多模态信息处理领域的研究热点.基于通用数据预训练的视觉语言预训练模型,通过实现图像与文本之间的高效语义对齐,为通用图像文本检索任务奠定了技术基础.然而,通用数据与遥感数据之间存在显著的领域鸿沟,导致基于通用数据预训练的视觉语言预训练模型在直接应用于遥感任务时性能受限.因此,需要通过微调使该视觉语言模型适应遥感领域独特的数据分布.然而,现有微调方法应用到遥感领域时面临着两大核心挑战.其一,跨模态对齐不足:现有微调方法缺乏显式的跨模态信息交互机制,难以充分建模图文之间的内在关联;其二,细粒度语义表征困难:现有方法往往难以捕捉遥感图像中目标尺度差异悬殊、地物类别间相似度高、空间拓扑关系复杂等精细化的语义信息.尤其在处理小目标或由相似地物引发的语义混淆问题时性能受限,显著降低了检索准确性.本文针对遥感图像文本检索任务中跨模态对齐不足与细粒度语义表征困难的问题,提出基于共享提示与Mamba适配器的微调方法.该方法首先通过设计跨模态共享提示生成模块,建立图像与文本特征的显式交互机制;然后构建面向遥感场景的图像与文本的双分支Mamba适配器微调模块,分别实现图像与文本特征的细粒度表征;最后,采用对比损失与隶属损失,缓解由遥感图像小目标或相似地物引起的语义混淆问题.实验结果表明,本方法在遥感图像描述数据集(Remote Sensing Image Captioning Dataset,RSICD)和遥感图像文本匹配数据集(Remote Sensing Image-Text Match Dataset,RSITMD)数据集上平均召回率分别达到37.3%和48.05%,相较于当前最优的适配器微调方法分别提升3.68%和1.52%.此外,消融实验验证了共享提示生成模块与Mamba适配器的有效性.
Remote sensing image-text retrieval aims to quickly and accurately retrieve semantically matching text or images from a massive remote sensing image-text database based on a given image or text. With the rapid development of Earth observation technology
the application value of this technology in fields such as urban planning
disaster emergency response
and environmental monitoring has become increasingly prominent
making it a research hotspot in the current field of multimodal information processing. Vision-language pre-training models
pre-trained on general-domain data
have laid the technical foundation for general image-text retrieval tasks by achieving efficient semantic alignment between images and text. However
a significant domain gap exists between general and remote sensing data
which limits the performance of these pre-trained models when directly applied to remote sensing tasks. Therefore
fine-tuning is necessary to adapt the vision-language model to the unique data distribution of the remote sensing domain. However
existing fine-tuning methods face two core challenges when applied to the remote sensing domain. First
there is insufficient cross-modal alignment: current fine-tuning methods lack explicit cross-modal information interaction mechanisms
making it difficult to fully model the intrinsic correlation between images and text. Second
it is difficult to achieve fine-grained semantic representation: existing methods often struggle to capture fine-grained semantic information in remote sensing images
such as vast differences in target scales
high similarity between ground object classes
and complex spatial-topological relationships. Performance is particularly limited when dealing with small targets or semantic confusion caused by similar ground objects
which significantly reduces retrieval accuracy. This paper addresses the problems of insufficient cross-modal alignment and difficulty in fine-grained semantic representation in remote sensing image-text retrieval tasks by proposing a fine-tuning method based on a shared prompt and Mamba adapter. This method first establishes an explicit interaction mechanism for image and text features by designing a cross-modal shared prompt generation module. Then
it constructs a dual-branch Mamba adapter fine-tuning module for remote sensing scenarios to achieve fine-grained representation of image and text features
respectively. Finally
it uses contrastive loss and affiliation loss to alleviate the semantic confusion caused by small targets or similar ground objects in remote sensing images. Experimental results show that this method achieves mean average recall rates of 37.3% and 48.05% on the remote sensing image captioning dataset (RSICD) and remote sensing image-text match dataset (RSITMD) datasets
respectively
which are improvements of 3.68% and 1.52% compared to the current state-of-the-art adapter fine-tuning method. Furthermore
ablation studies have verified the effectiveness of the shared prompt generation module and the Mamba adapter.
罗忠涛 , 龚彦如 , 黎霁萱 , 等 . 天波超视距雷达地海杂波图像增强与检测器设计 [J ] . 电子学报 , 2024 , 52 ( 12 ): 4037 - 4047 .
LUO Z T , GONG Y R , LI J X , et al . Land-sea clutter image enhancement and detector design for sky-wave over-the-horizon radar [J ] . Acta Electronica Sinica , 2024 , 52 ( 12 ): 4037 - 4047 . (in Chinese)
张若愚 , 聂婕 , 宋宁 , 等 . 基于布局化-语义联合表征遥感图文检索方法 [J ] . 北京航空航天大学学报 , 2024 , 50 ( 2 ): 671 - 683 .
ZHANG R Y , NIE J , SONG N , et al . Remote sensing image-text retrieval based on layout semantic joint representation [J ] . Journal of Beijing University of Aeronautics and Astronautics , 2024 , 50 ( 2 ): 671 - 683 . (in Chinese)
RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [C ] // International Conference on Machine Learning . Cambridge : PMLR , 2021 : 8748 - 8763 .
LIU F , CHEN D L , GUAN Z , et al . RemoteCLIP: A vision language foundation model for remote sensing [J ] . IEEE Transactions on Geoscience and Remote Sensing , 2024 , 62 : 5622216 .
WANG Z C , PRABHA R , HUANG T Y , et al . SkyScript: A large and semantically diverse vision-language dataset for remote sensing [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 6 ): 5805 - 5813 .
ZHANG Z L , ZHAO T C , GUO Y L , et al . RS5M and GeoRSCLIP: A large-scale vision- language dataset and a large vision-language model for remote sensing [J ] . IEEE Transactions on Geoscience and Remote Sensing , 2024 , 62 : 5642123 .
WANG Y D , GHAMISI P . RSAdapter: Adapting multimodal models for remote sensing visual question answering [J ] . IEEE Transactions on Geoscience and Remote Sensing , 2024 , 62 : 5628313 .
KIM S , JEONG B , KIM D , et al . Efficient and versatile robust fine-tuning of zero-shot models [M ] // Computer Vision-ECCV 2024 . Cham : Springer Nature Switzerland , 2024 : 440 - 458 .
LI X , LIAN D Z , LU Z H , et al . GraphAdapter: Tuning vision-language models with dual knowledge graph [C ] // Proceedings of the 37th Conference on Neural Information Processing Systems . San Diego : NeurIPS , 2023 : 13448 - 13466 .
MOUGHNIEH H , CHALHOUB M , NASRALLAH H , et al . Efficient adaptation for remote sensing visual grounding [C ] // Proceedings of the IEEE International Geoscience and Remote Sensing Symposium . Piscataway : IEEE , 2025 : 1 - 5 .
HUANG T . Efficient remote sensing with harmonized transfer learning and modality alignment [C ] // Proceedings of the International Conference on Learning Representations (Workshop) . Washington : ICLR , 2024 : 1 - 14 .
HWANG S , LAHOTI A S , PUDUPPULLY R , et al . Hydra: Bidirectional state space models through generalized matrix mixers [C ] // Advances in Neural Information Processing Systems 37 . San Diego : NeurIPS , 2024 : 110876 - 110908 .
XIAO Y , SONG L , HUANG S , et al . Mambatree: Tree topology is all you need in state space model [C ] // Proceedings of the 38th International Conference on Neural Information Processing Systems 37 . San Diego : NeurIPS , 2024 : 75329 - 75354 .
ZHOU K Y , YANG J K , LOY C C , et al . Learning to prompt for vision-language models [J ] . International Journal of Computer Vision , 2022 , 130 ( 9 ): 2337 - 2348 .
JIA M L , TANG L M , CHEN B C , et al . Visual prompt tuning [M ] // Computer Vision - ECCV 2022 . Cham : Springer Nature Switzerland , 2022 : 709 - 727 .
GU A , DAO T . Mamba: Linear-time sequence modeling with selective state spaces [EB/OL ] . ( 2024-05-31 )[ 2025-04-18 ] . https://arXiv.org/abs/2312.00752 https://arXiv.org/abs/2312.00752 .
HOULSBY N , GIURGIU A , JASTRZEBSKIS , et al . Parameter-efficient transfer learning for NLP [C ] // International Conference on Machine Learning . Cambridge : PMLR , 2019 : 2790 - 2799 .
GAO P , GENG S J , ZHANG R R , et al . CLIP-adapter: Better vision-language models with feature adapters [J ] . International Journal of Computer Vision , 2024 , 132 ( 2 ): 581 - 595 .
CHEN S F , GE C J , TONG Z , et al . Adaptformer: Adapting vision transformers for scalable visual recognition [C ] // Proceedings of the 36th International Conference on Neural Information Processing Systems . Cambridge : PMLR , 2022 : 16664 - 16678 .
DOSOVITSKIY A , BEYER L , KOLESNIKO-V A , et al . An image is worth 16 x 16 words:Transformers for image recognition at scale[EB/OL ] . ( 2021-06-03 )[ 2025-04-18 ] . https://arxiv.org/abs/2010.11929 https://arxiv.org/abs/2010.11929 .
JIANG H J , ZHANG J K , HUANG R , et al . Cross-modal adapter for vision-language retrieval [J ] . Pattern Recognition , 2025 , 159 : 111144 .
LU H , HUO Y , YANG G , et al . Uniadapter: Unified parameter-efficient transfer learning for cross-modal modeling [EB/OL ] . ( 2023-05-21 ) [ 2025-04-18 ] . https://doi.org/10.48550/arXiv.23-02.06605 https://doi.org/10.48550/arXiv.23-02.06605 .
YUAN Y , ZHAN Y , XIONG Z T . Parameter-efficient transfer learning for remote sensing image-text retrieval [J ] . IEEE Transactions on Geoscience and Remote Sensing , 2023 , 61 : 5619014 .
PAN J C , MA Q , BAI C . A prior instruction representation framework for remote sensing image-text retrieval [C ] // Proceedings of the 31st ACM International Conference on Multimedia . New York : ACM , 2023 : 611 - 620 .
YANG J , LI S Y , ZHAO M Q . Parameter-efficient reparameterization tuning for remote sensing image-text retrieval [J ] . IEEE Transactions on Geoscience and Remote Sensing , 2025 , 63 : 4702315 .
LU X Q , WANG B Q , ZHENG X T , et al . Exploring models and data for remote sensing image caption generation [J ] . IEEE Transactions on Geoscience and Remote Sensing , 2018 , 56 ( 4 ): 2183 - 2195 .
YUAN Z Q , ZHANG W K , FU K , et al . Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval [J ] . IEEE Transactions on Geoscience and Remote Sensing , 2022 , 60 : 4404119 .
SELVARAJU R R , COGSWELL M , DAS A , et al . Grad-cam: Visual explanations from deepnetworks via gradient-based localization [C ] // Proeedings of the IEEE International Conference on Computer Vision . Piscataway : IEEE , 2017 : 618 - 626 .
0
Views
35
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621