面向机器视觉的文本提示引导的图像编码

黄志勐; 高峰; 杨帆; 马思伟

doi:10.12263/DZXB.20250778

您当前的位置：

首页 >

文章列表页 >

面向机器视觉的文本提示引导的图像编码

学术论文 | 更新时间：2026-06-04

- 面向机器视觉的文本提示引导的图像编码
- Text Prompted Image Coding for Machine
- 电子学报 2026年54卷第1期页码：19-31
- 作者机构：
  
  1.北京大学计算机学院，北京 100871
  2.北京大学艺术学院，北京 100871
- 作者简介：
  
  [ "黄志勐男，1997年6月出生于山东省济宁市。现为北京大学计算机学院博雅博士后。主要研究方向为智能编码、面向机器视觉的图像视频编码、多媒体技术和信号处理。E-mail: zmhuang@pku.edu.cn" ]
  [ "高峰男，1983年11月出生于北京市。现为北京大学艺术学院研究员、博士生导师、创意实验室主任。主要研究方向为计算机与艺术交叉学科，探索人类未来生活中人工智能技术在教育、艺术、健康等领域的应用。E-mail: gaof@pku.edu.cn" ]
  [ "杨帆男，1992年10月出生于江西省瑞金市。现为北京大学艺术学院高级工程师。主要研究方向为多媒体与人工智能、计算艺术。E-mail: fyang.eecs@pku.edu.cn" ]
  [ "马思伟男，1979年2月出生于山东省聊城市。现为北京大学博雅特聘教授，北京大学计算机学院党委副书记、博士生导师，视频与视觉技术国家工程研究中心副主任。主要研究方向为视频编码与处理。中国电子学会会员编号：E190014267M。E-mail: swma@pku.edu.cn" ]
- 基金信息：
  
  国家自然科学基金(62025101;62176006);中国博士后科学基金(2025M771511)
- DOI：10.12263/DZXB.20250778
  中图分类号： TP37;TP39
- 收稿：2025-09-07，
  
  录用：2026-01-19，
  
  纸质出版：2026-01-25
- 稿件说明：
移动端阅览
黄志勐, 高峰, 杨帆, 等. 面向机器视觉的文本提示引导的图像编码[J]. 电子学报, 2026, 54(01): 19-31.

HUANG Zhimeng, GAO Feng, YANG Fan, et al. Text Prompted Image Coding for Machine[J]. Acta Electronica Sinica, 2026, 54(01): 19-31.
黄志勐, 高峰, 杨帆, 等. 面向机器视觉的文本提示引导的图像编码[J]. 电子学报, 2026, 54(01): 19-31. DOI：10.12263/DZXB.20250778

HUANG Zhimeng, GAO Feng, YANG Fan, et al. Text Prompted Image Coding for Machine[J]. Acta Electronica Sinica, 2026, 54(01): 19-31. DOI：10.12263/DZXB.20250778

摘要

近年来，随着物联网（Internet of Things，IoT）、语义通信以及智慧城市等经典机器间通信（Machine to Machine，M2M）场景的快速发展，海量视觉数据在设备间的实时传输与高效处理成为了一项关键挑战。在此背景下，传统以人眼感知质量为核心的图像编码方法，因其优化目标与机器视觉任务需求存在本质差异，往往在面向机器视觉分析时出现分析精度不足的问题。为此，面向机器视觉的图像编码（Image Coding for Machine，ICM）应运而生，其核心目标是在保证下游机器视觉任务（如分类、检测、分割等）分析精度的同时，实现尽可能低的编码码率，从而更好地适配M2M场景中的带宽与存储约束。然而，现有ICM方法仍面临两大瓶颈：其一，在极低码率条件下性能急剧下降。这是由于现有方法多依赖于端到端的非线性变换提取视觉特征，未能充分挖掘和利用图像中高层语义信息的紧凑表示，导致特征编码效率不足；其二，在开放场景下的泛化能力弱。多数方法针对单一任务、单一数据集进行优化，缺乏对未知类别、跨域数据的适应能力，难以在实际动态环境中保持稳定的分析性能。为突破上述限制，本文提出一种文本提示引导的面向机器视觉图像编码框架（Text-prompted Image Coding for Machine，T-ICM）。该框架的核心思想是将图像信息解耦为语义信息与纹理信息两个互补的组成部分，其中，语义信息以结构化文本提示（如对象类别、位置描述）的形式进行表示与编码，纹理信息则通过一种任务无关的通用视觉特征进行提取与压缩。在编码端，文本提示因其高度抽象和语义紧凑的特性，可以显著降低整体码率；通用特征则通过我们提出的分组特征编码模块进行高效压缩。在解码端，文本提示不仅用于直接解析完成分类、检测等任务，更重要的是作为引导信号，通过提示编码器与掩膜解码器，动态调整重建通用特征的语义感知区域，实现特征层面的域自适应与任务适配，从而显著提升模型在开放场景下的鲁棒性。本文在多个标准数据集与任务上对T-ICM进行了全面评估。实验表明，在语义分割和实例分割等密集预测任务上，T-ICM在极低码率下仍能保持接近原始图像输入的分析精度，其性能显著优于H.266/VVC、基于深度学习的图像编码器以及现有的其他ICM方法。本研究通过将语义信息迁移至高度压缩的文本模态进行传输，并利用其引导特征重建，T-ICM在编码效率与任务性能之间实现了更优的权衡，为未来语义通信、边缘智能协同，以及自适应机器视觉系统的发展提供了新的思路与技术支撑。

Abstract

In recent years

with the rapid development of classic machine-to-machine (M2M) communication scenarios such as the internet of things (IoT)

semantic communication

and smart cities

the real-time transmission and efficient processing of massive visual data between devices have become a critical challenge. In this context

traditional image coding methods

which are primarily optimized for human perceptual quality

often suffer from insufficient analysis accuracy when applied to machine vision tasks due to a fundamental mismatch between their optimization objectives and the requirements of machine analysis. Consequently

image coding for machine (ICM) has emerged

aiming to maintain high analysis accuracy for downstream machine vision tasks (e.g.

classification

detection

segmentation) while achieving the lowest possible bitrate

thereby better adapting to the bandwidth and storage constraints in M2M scenarios. However

existing ICM methods still face two major bottlenecks. First

their performance degrades sharply under extremely low bitrates. This is because most current approaches rely on end-to-end nonlinear transformations to extract visual features

failing to fully exploit the compact representation of high-level semantic information within images

which leads to inefficient feature coding. Second

they exhibit weak generalization in open-set scenarios. Most methods are optimized for single tasks or single datasets

lacking the adaptability to unseen categories or cross-domain data

and thus struggle to maintain stable analytical performance in practical

dynamic environments. To overcome these limitations

this paper proposes a novel text-prompted image coding for machine (T-ICM) framework. The core idea is to decouple image information into two complementary components: semantic information and texture information. The semantic information is represented and encoded in the form of structured text prompts (e.g.

object categories

location descriptions)

while the texture information is extracted and compressed as task-agnostic general visual features. At the encoder side

the text prompts

owing to their highly abstract and semantically compact nature

can significantly reduce the overall bitrate. The general features are efficiently compressed via our proposed grouped feature coding module. At the decoder side

the text prompts serve not only for direct parsing to accomplish tasks like classification and detection but

more importantly

act as guidance signals. Through a prompt encoder and a mask decoder

they dynamically adjust the semantically relevant regions of the reconstructed general features

enabling feature-level domain adaptation and task-specific adaptation

thereby significantly enhancing the model’s robustness in open-set scenarios. The proposed T-ICM is comprehensively evaluated on multiple standard datasets and tasks. Experiments demonstrate that on dense prediction tasks such as semantic segmentation and instance segmentation

T-ICM can maintain analysis accuracy close to that of using the original uncompressed images even at very low bitrates

significantly outperforming H.266/VVC

learned image codecs

and other existing ICM methods. By migrating semantic information to the highly compressed text modality for transmission and utilizing it to guide feature reconstruction

T-ICM achieves a superior trade-off between coding efficiency and task performance. This work provides a novel perspective and technical foundation for the future development of semantic communication

collaborative edge intelligence

and adaptive machine vision systems.

关键词

Keywords

references

Bross B , Wang Yekui , Ye Yan , et al . Overview of the versatile video coding (VVC) standard and its applications [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2021 , 31 ( 10 ): 3736 - 3764 . DOI: 10.1109/tcsvt.2021.3101953 http://dx.doi.org/10.1109/tcsvt.2021.3101953

Zhang Jiaqi , Jia Chuanmin , Lei Meng , et al . Recent development of AVS video coding standard: AVS3 [C ] // 2019 Picture Coding Symposium . Piscataway : IEEE , 2019 : 1 - 5 . DOI: 10.1109/pcs48520.2019.8954503 http://dx.doi.org/10.1109/pcs48520.2019.8954503

Ballé J , Laparra V , Simoncelli E P . End-to-end optimized image compression [C/OL ] // Proceedings of the 5th International Conference on Learning Representations , ICLR , 2017 , https://openreview.net/forum?id=rJxdQ3jeg https://openreview.net/forum?id=rJxdQ3jeg . DOI: 10.1109/pcs.2016.7906310 http://dx.doi.org/10.1109/pcs.2016.7906310

董浩 , 李劭辉 , 阚诺文 , 等 . 基于深度压缩感知的联合信源信道编码方法研究 [J ] . 电子学报 , 2025 , 53 ( 7 ): 2178 - 2192 .

Dong Hao , Li Shaohui , Kan Nuowen , et al . Research on joint source-channel coding method based on deep compressive sensing [J ] . Acta Electronica Sinica , 2025 , 53 ( 7 ): 2178 - 2192 . (in Chinese)

Redondi A , Baroffio L , Bianchi L , et al . Compress-then-analyze versus analyze-then-compress: What is best in visual sensor networks [J ] . IEEE Transactions on Mobile Computing , 2016 , 15 ( 12 ): 3000 - 3013 . DOI: 10.1109/tmc.2016.2519340 http://dx.doi.org/10.1109/tmc.2016.2519340

Huang Zhimeng , Jia Chuanmin , Wang Shanshe , et al . Visual analysis motivated rate-distortion model for image coding [C ] // 2021 IEEE International Conference on Multimedia and Expo . Piscataway : IEEE , 2021 : 1 - 6 . DOI: 10.1109/icme51207.2021.9428417 http://dx.doi.org/10.1109/icme51207.2021.9428417

Bajić I V , Lin Weisi , Tian Yonghong . Collaborative intelligence: Challenges and opportunities [C ] // ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2021 : 8493 - 8497 . DOI: 10.1109/icassp39728.2021.9413943 http://dx.doi.org/10.1109/icassp39728.2021.9413943

Feng Ruoyu , Liu Jinming , Jin Xin , et al . Prompt-ICM: A unified framework towards image coding for machines with task-driven prompts [PP/OL ] . V1.arXiv ( 2023-05-04 )[ 2023-10-01 ] . https://doi.org/10.48550/arXiv.2305.02578 https://doi.org/10.48550/arXiv.2305.02578 .

Wang Z , Simoncelli E P , Bovik A C . Multiscale structural similarity for image quality assessment [C ] // The Thirty-Seventh Asilomar Conference on Signals , Systems & Computers, 2003 . Piscataway : IEEE , 2003 : 1398 - 1402 . DOI: 10.1109/acssc.2003.1292181 http://dx.doi.org/10.1109/acssc.2003.1292181

Zhang R , Isola P , Efros A A , et al . The unreasonable effectiveness of deep features as a perceptual metric [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 586 - 595 . DOI: 10.1109/cvpr.2018.00068 http://dx.doi.org/10.1109/cvpr.2018.00068

Stern M K , Johnson J H . Just noticeable difference [M ] //Weiner I B, Craighead W E. The Corsini Encyclopedia of Psychology . 4th ed . Hoboken : John Wiley & Sons , 2010 : 1 - 2 . DOI: 10.1002/9780470479216.corpsy0481 http://dx.doi.org/10.1002/9780470479216.corpsy0481

Fu C M , Alshina E , Alshin A , et al . Sample adaptive offset in the HEVC standard [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2012 , 22 ( 12 ): 1755 - 1764 . DOI: 10.1109/tcsvt.2012.2221529 http://dx.doi.org/10.1109/tcsvt.2012.2221529

Tsai C Y , Chen C Y , Yamakage T , et al . Adaptive loop filtering for video coding [J ] . IEEE Journal of Selected Topics in Signal Processing , 2013 , 7 ( 6 ): 934 - 945 . DOI: 10.1109/jstsp.2013.2271974 http://dx.doi.org/10.1109/jstsp.2013.2271974

Ballé J , Minnen D , Singh S , et al . Variational image compression with a scale hyperprior [PP/OL ] . V1.arXiv ( 2018-12-01 )[ 2025-10-26 ] . https://arxiv.org/abs/1802.01436 https://arxiv.org/abs/1802.01436 .

Cheng Zhengxue , Sun Heming , Takeuchi M , et al . Learned image compression with discretized Gaussian mixture likelihoods and attention modules [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 7936 - 7945 . DOI: 10.1109/cvpr42600.2020.00796 http://dx.doi.org/10.1109/cvpr42600.2020.00796

Lu Ming , Guo Peiyao , Shi Huiqing , et al . Transformer-based image compression [C ] // 2022 Data Compression Conference . SnowBird : IEEE , 2022 : 469 - 469 . DOI: 10.1109/dcc52660.2022.00080 http://dx.doi.org/10.1109/dcc52660.2022.00080

Li H , Li S , Dai W , et al . Frequency-aware transformer for learned image compression [EB/OL ] . ( 2024 )[ 2025-10-26 ] . https://openreview.net/forum id=HKGQDDTuvZ https://openreview.net/forumid=HKGQDDTuvZ .

Theis L , Salimans T , Hoffman M D , et al . Lossy compression with Gaussian diffusion [PP/OL ] . V2.arXiv ( 2022-12-31 )[ 2025-08-26 ] . https://doi.org/10.48550/arXiv.2206.08889 https://doi.org/10.48550/arXiv.2206.08889 .

Xia Yichong , Zhou Yimin , Wang Jinpeng , et al . DiffPC: Diffusion-based high perceptual fidelity image compression with semantic refinement [EB/OL ] . ( 2025 )[ 2025-10-26 ] . https://openreview.net/forum id=RL7PycCtAO https://openreview.net/forumid=RL7PycCtAO .

Mentzer F , Agustsson E , Tschannen M . M2T: Masking transformers twice for faster decoding [PP/OL ] . V1.arXiv ( 2023-04-14 )[ 2025-08-26 ] . https://doi.org/10.48550/arXiv.2304.07313 https://doi.org/10.48550/arXiv.2304.07313 .

Cai Qi , Chen Zhifeng , Wu D O , et al . A novel video coding strategy in HEVC for object detection [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2021 , 31 ( 12 ): 4924 - 4937 . DOI: 10.1109/tcsvt.2021.3056134 http://dx.doi.org/10.1109/tcsvt.2021.3056134

Chen Zhibo , He Tianyu . Learning based facial image compression with semantic fidelity metric [J ] . Neurocomputing , 2019 , 338 : 16 - 25 . DOI: 10.1016/j.neucom.2019.01.086 http://dx.doi.org/10.1016/j.neucom.2019.01.086

Le N , Zhang Honglei , Cricri F , et al . Image coding for machines: An end-to-end learned approach [C ] // ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2021 : 1590 - 1594 . DOI: 10.1109/icassp39728.2021.9414465 http://dx.doi.org/10.1109/icassp39728.2021.9414465

Patwa N , Ahuja N , Somayazulu S , et al . Semantic-preserving image compression [C ] // 2020 IEEE International Conference on Image Processing . Piscataway : IEEE , 2020 : 1281 - 1285 . DOI: 10.1109/icip40778.2020.9191247 http://dx.doi.org/10.1109/icip40778.2020.9191247

Chamain L D , Racapé F , Bégaint J , et al . End-to-End optimized image compression for machines: A study [C ] // 2021 Data Compression Conference . Piscataway : IEEE , 2021 : 163 - 172 . DOI: 10.1109/dcc50243.2021.00024 http://dx.doi.org/10.1109/dcc50243.2021.00024

Le N , Zhang Honglei , Cricri F , et al . Learned image coding for machines: A content-adaptive approach [C ] // 2021 IEEE International Conference on Multimedia and Expo . Piscataway : IEEE , 2021 : 1 - 6 . DOI: 10.1109/icme51207.2021.9428224 http://dx.doi.org/10.1109/icme51207.2021.9428224

Yan Ning , Gao Changsheng , Liu Dong , et al . SSSIC: Semantics-to-signal scalable image coding with learned structural representations [J ] . IEEE Transactions on Image Processing , 2021 , 30 : 8939 - 8954 . DOI: 10.1109/tip.2021.3121131 http://dx.doi.org/10.1109/tip.2021.3121131

Feng Ruoyu , Jin Xin , Guo Zongyu , et al . Image coding for machines with omnipotent feature learning [C ] // 17th European Conference on Computer Vision-ECCV 2022 . Heidelberg : Springer , 2022 : 510 - 528 . DOI: 10.1007/978-3-031-19836-6_29 http://dx.doi.org/10.1007/978-3-031-19836-6_29

Chen Y H , Weng Y C , Kao C H , et al . TransTIC: Transferring transformer-based image compression from human perception to machine perception [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 23240 - 23250 . DOI: 10.1109/iccv51070.2023.02129 http://dx.doi.org/10.1109/iccv51070.2023.02129

Xia Sifeng , Liang Kunchangtai , Yang Wenhan , et al . An emerging coding paradigm VCM: A scalable coding approach beyond feature and signal [C ] // 2020 IEEE International Conference on Multimedia and Expo . Piscataway : IEEE , 2020 : 1 - 6 . DOI: 10.1109/icme46284.2020.9102843 http://dx.doi.org/10.1109/icme46284.2020.9102843

He Kaiming , Chen Xinlei , Xie Saining , et al . Masked autoencoders are scalable vision learners [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 15979 - 15988 . DOI: 10.1109/cvpr52688.2022.01553 http://dx.doi.org/10.1109/cvpr52688.2022.01553

Dosovitskiy A , Beyer L , Kolesnikov A , et al . An image is worth 16 × 16 words: Transformers for image recognition at scale [C/OL ] // 9th International Conference on Learning Representations , ICLR , https://iclr.cc/virtual/2021/poster/3013 https://iclr.cc/virtual/2021/poster/3013 .

Redmon J , Divvala S , Girshick R , et al . You only look once: Unified, real-time object detection [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2016 : 779 - 788 . DOI: 10.1109/cvpr.2016.91 http://dx.doi.org/10.1109/cvpr.2016.91

Caron M , Touvron H , Misra I , et al . Emerging properties in self-supervised vision transformers [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2021 : 9630 - 9640 . DOI: 10.1109/iccv48922.2021.00951 http://dx.doi.org/10.1109/iccv48922.2021.00951

Yin Shukang , Fu Chaoyou , Zhao Sirui , et al . Woodpecker: Hallucination correction for multimodal large language models [J ] . Science China Information Sciences , 2024 , 67 ( 12 ): 220105 . DOI: 10.1007/s11432-024-4251-x http://dx.doi.org/10.1007/s11432-024-4251-x

Radford A , Kim J W , Hallacy C , et al . Learning transferable visual models from natural language supervision [C ] // Proceedings of the 38th International Conference on Machine Learning . Virtual Event : PMLR , 2021 : 8748 - 8763 . DOI: 10.48550/arXiv.2103.00020 http://dx.doi.org/10.48550/arXiv.2103.00020

Kirillov A , Mintun E , Ravi N , et al . Segment anything [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 3992 - 4003 . DOI: 10.1109/iccv51070.2023.00371 http://dx.doi.org/10.1109/iccv51070.2023.00371

Lin T Y , Goyal P , Girshick R , et al . Focal loss for dense object detection [C ] // 2017 IEEE International Conference on Computer Vision . Piscataway : IEEE , 2017 : 2999 - 3007 . DOI: 10.1109/iccv.2017.324 http://dx.doi.org/10.1109/iccv.2017.324

Feng R , Qi Y , Liu J , et al . Diff-ICMH: Harmonizing machine and human vision in image compression with generative prior [EB/OL ] .( 2025 )[ 2025-10-26 ] . https://openreview.net/forum id=ne3nYEcGsf https://openreview.net/forumid=ne3nYEcGsf .

Lin T Y , Maire M , Belongie S , et al . Microsoft COCO: Common objects in context [M ] // 13th European Conference on Computer Vision-ECCV 2014 . Heidelberg : Springer , 2014 : 740 - 755 . DOI: 10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48

Hong J , Fulton M , Sattar J . TrashCan: A semantically-segmented dataset towards visual detection of marine debris [PP/OL ] . V1.arXiv ( 2020-07-16 )[ 2025-08-26 ] . https://doi.org/10.48550/arXiv.2007.08097 https://doi.org/10.48550/arXiv.2007.08097 .

Wah C , Branson S , Welinder P , et al . The Caltech-UCSD Birds-200-2011 Dataset [R ] . California Institute of Technology , 2011 .

Khosla A , Jayadevaprakash N , Yao B , et al . Novel dataset for fine-grained image categorization [EB/OL ] . ( 2011 )[ 2025-10-26 ] . https://people.csail.mit.edu/khosla/papers/cub2011.pdf https://people.csail.mit.edu/khosla/papers/cub2011.pdf .

Barman N , Martini M G , Reznik Y . Bjøntegaard delta (BD): A tutorial overview of the metric, evolution, challenges, and recommendations [PP/OL ] . V1.arXiv ( 2024-01-08 )[ 2025-08-26 ] . https://doi.org/10.48550/arXiv.2401.04039 https://doi.org/10.48550/arXiv.2401.04039 .

Ghiasi G , Cui Yin , Srinivas A , et al . Simple copy-paste is a strong data augmentation method for instance segmentation [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2021 : 2917 - 2927 . DOI: 10.1109/cvpr46437.2021.00294 http://dx.doi.org/10.1109/cvpr46437.2021.00294

Minderer M , Gritsenko A , Stone A , et al . Simple open-vocabulary object detection [C ] // Proceedings of the 17th European Conference on Computer Vision . Heidelberg : Springer , 2022 : 728 - 755 . DOI: 10.1007/978-3-031-20080-9_42 http://dx.doi.org/10.1007/978-3-031-20080-9_42

Liu Shilong , Zeng Zhaoyang , Ren Tianhe , et al . Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection [PP/OL ] . V5.arXiv ( 2024-07-19 )[ 2025-08-26 ] . https://doi.org/10.48550/arXiv.2303.05499 https://doi.org/10.48550/arXiv.2303.05499 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

SDDA：无监督的风格和分布域适应夜间语义分割方法

基于加权优先级与数据包到达时间的MP-QUIC调度算法

基于超表面的低副瓣高口径效率反射阵列天线

基于注意力惩罚和自适应学习的场景图增强联合多模态方面情感分析

基于耦合三角型拓扑的IPD高选择性带通滤波器综合设计和实验研究