

浏览全部资源
扫码关注微信
1.太原理工大学电子信息工程学院,山西晋中 030600
2.中国科学院自动化研究所,北京 100190
Received:05 June 2024,
Revised:2024-10-30,
Published:25 February 2025
移动端阅览
王楠井, 刘阿建, 梁凤梅, 等. 基于图像内容理解的判别性类别提示学习[J]. 电子学报, 2025, 53(02): 493-502.
WANG Nan-jing, LIU A-jian, LIANG Feng-mei, et al. Discriminative Category Prompt Learning Based on Image Content Understanding[J]. Acta Electronica Sinica, 2025, 53(02): 493-502.
王楠井, 刘阿建, 梁凤梅, 等. 基于图像内容理解的判别性类别提示学习[J]. 电子学报, 2025, 53(02): 493-502. DOI:10.12263/DZXB.20240522
WANG Nan-jing, LIU A-jian, LIANG Feng-mei, et al. Discriminative Category Prompt Learning Based on Image Content Understanding[J]. Acta Electronica Sinica, 2025, 53(02): 493-502. DOI:10.12263/DZXB.20240522
近年来,通过图像与文本的联合表示,基于对比语言-图像预训练(Contrastive Language-Image Pre-training,CLIP)的方法将文本信息作为分类器的权值,在通用图像识别任务中展现出卓越性能.但是现有方法仅单独构建类别文本提示,比如上下文优化(Context Optimization,CoOp)和条件上下文优化(Conditional Context Optimization,CoCoOp)等,没有考虑图像的内容语义信息与类别的重要性,限制了模型对图像类别的理解与判别.为了解决上述问题,本文在CLIP的基础上提出了一种新方法:基于图像内容理解的判别性类别提示学习(Discriminative Category Prompt Learning based on image content understanding,DCPL),借助图像中丰富的内容特征来学习文本提示,提高文本提示对类别的判别性.具体来说,DCPL包含提示生成(Prompt Generation,PG)模块和文本监督(Text Supervision,TS)模块. PG模块将图像特征和初始化的查询向量作为输入,通过自注意力机制和交叉注意力机制使输出的文本提示中包含充分的图像语义信息;TS模块将固定的类别提示模板作为监督,为可学习文本提示在类别层面和logits层面注入类别信息,增强了类别的重要性.最后,DCPL在ImageNet、Caltech101和Oxford-Pets等11个公开分类数据集上的16-shots平均准确率达到了81.84%,较以往最优方法Cross-Modal的平均准确率提升了0.98个百分点.
In recent years
the contrastive language-image pre-training (CLIP)-based method takes the text information as the weight of the classifier through the joint representation of image and text and shows excellent performance in the general image recognition task. However
the existing methods only construct text prompts of categories
such as context optimization (CoOp) and conditional context optimization (CoCoOp)
without considering the importance of image content semantic information and categories
which limits the model’s understanding and discrimination of image categories. To solve the above problems
this article proposes a new method based on CLIP: discriminative category prompt learning based on image content understanding (DCPL)
which uses rich content features in images to learn text prompts and introduces manual templates to improve the discrimination of text prompts on categories. Specifically
DCPL includes a prompt generation module and a text supervision module: The prompt generation module takes image features and initialized query vectors as inputs
and makes the output text prompt contain sufficient image semantic information through the self/cross-attention mechanism; The text supervision module uses the fixed category prompt template as the supervision to inject category information into the category level and logits level for the learnable text prompt
increasing the importance of categories. Finally
the average accuracy of 16 shots of DCPL on 11 public classified datasets
such as ImageNet
Caltech101
Oxford pets
etc.
is 81.84%
The average accuracy has increased by 0.98 percentage points compared with that of the previous optimal method
Cross-Modal.
KRIZHEVSKY A , SUTSKEVER I , HINTON G E . Imagenet classification with deep convolutional neural networks [J ] . Communications of the ACM , 2017 , 60 ( 6 ): 84 - 90 .
SIMONYAN K , ZISSERMAN A . Very deep convolutional networks for large-scale image recognition [EB/OL ] . ( 2014-09-04 )[ 2024-06-05 ] . https://arxiv.org/abs/1409.1556v6 https://arxiv.org/abs/1409.1556v6 .
SZEGEDY C , LIU W , JIA Y Q , et al . Going deeper with convolutions [C ] // 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2015 : 1 - 9 .
HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 770 - 778 .
殷炯 , 张哲东 , 高宇涵 , 等 . 视觉语言预训练综述 [J ] . 软件学报 , 2023 , 34 ( 5 ): 2000 - 2023 .
YIN J , ZHANG Z D , GAO Y H , et al . Survey on vision-language pre-training [J ] . Journal of Software , 2023 , 34 ( 5 ): 2000 - 2023 . (in Chinese)
RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [C ] // International Conference on Machine Learning . New York : PMLR , 2021 , 139 : 8748 - 8763 .
GU X Y , LIN T Y , KUO W C , et al . Open-vocabulary object detection via vision and language knowledge distillation [EB/OL ] . ( 2021-04-28 )[ 2024-06-05 ] . https://arxiv.org/abs/2104.13921v3 https://arxiv.org/abs/2104.13921v3 .
SAHARIA C , CHAN W , SAXENA S , et al . Photorealistic text-to-image diffusion models with deep language understanding [C ] // Proceedings of the 36th International Conference on Neural Information Processing Systems . New York : Curran Associates Inc. , 2022 , 35 : 36479 - 36494 .
ALAYRAC J B , DONAHUE J , LUC P , et al . Flamingo: A visual language model for few-shot learning [J ] . Advances in Neural Information Processing Systems , 2022 , 35 : 23716 - 23736 .
ZHOU K Y , YANG J K , LOY C C , et al . Learning to prompt for vision-language models [J ] . International Journal of Computer Vision , 2022 , 130 ( 9 ): 2337 - 2348 .
ZHOU K Y , YANG J K , LOY C C , et al . Conditional prompt learning for vision-language models [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 16795 - 16804 .
KHATTAK M U , RASHEED H , MAAZ M , et al . MaPLe: Multi-modal prompt learning [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 19113 - 19122 .
VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems , New York : Curran Associates Inc. , 2017 : 6000 - 6010 .
CHEN C F R , FAN Q F , PANDA R . CrossViT: Cross-attention multi-scale vision transformer for image classification [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 347 - 356 .
LI J N , LI D X , SAVARESE S , et al . BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models [C ] // International Conference on Machine Learning . New York : PMLR , 2023 : 19730 - 19742 .
RADFORD A , WU J , CHILD R , et al . Language models are unsupervised multitask learners [J ] . OpenAI blog , 2019 , 1 ( 8 ): 9 .
廖宁 , 曹敏 , 严骏驰 . 视觉提示学习综述 [J ] . 计算机学报 , 2024 , 47 ( 4 ): 790 - 820 .
LIAO N , CAO M , YAN J C . Visual prompt learning: A survey [J ] . Chinese Journal of Computers , 2024 , 47 ( 4 ): 790 - 820 . (in Chinese)
YAO H T , ZHANG R , XU C S . Visual-language prompt tuning with knowledge-guided context optimization [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 6757 - 6767 .
GAO T Y , FISCH A , CHEN D Q . Making pre-trained language models better few-shot learners [EB/OL ] . ( 2020-12-31 )[ 2024-06-05 ] . https://arxiv.org/abs/2012.15723v2 https://arxiv.org/abs/2012.15723v2 .
CUI G Q , HU S D , DING N , et al . Prototypical verbalizer for prompt-based few-shot tuning [EB/OL ] . ( 2022-3-18 )[ 2024-06-05 ] . https://arxiv.org/abs/2203.09770v1 https://arxiv.org/abs/2203.09770v1 .
DENG J , DONG W , SOCHER R , et al . ImageNet: A large-scale hierarchical image database [C ] // 2009 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2009 : 248 - 255 .
LI F F , FERGUS R , PERONA P . Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories [C ] // 2004 Conference on Computer Vision and Pattern Recognition Workshop . Piscataway : IEEE , 2005 : 178 .
XIAO J X , HAYS J , EHINGER K A , et al . SUN database: Large-scale scene recognition from abbey to zoo [C ] // 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2010 : 3485 - 3492 .
HELBER P , BISCHKE B , DENGEL A , et al . EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification [J ] . IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , 2019 , 12 ( 7 ): 2217 - 2226 .
SOOMRO K , ZAMIR A R , SHAH M . UCF101: A dataset of 101 human actions classes from videos in the wild [EB/OL ] . ( 2012-12-03 )[ 2024-06-05 ] . https://arxiv.org/abs/1212.0402v1 https://arxiv.org/abs/1212.0402v1 .
CIMPOI M , MAJI S , KOKKINOS I , et al . Describing textures in the wild [C ] // 2014 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2014 : 3606 - 3613 .
BOSSARD L , GUILLAUMIN M , VAN GOOL L . Food-101 - mining discriminative components with random forests [M ] // Lecture Notes in Computer Science . Cham : Springer International Publishing , 2014 : 446 - 461 .
KRAUSE J , STARK M , JIA D , et al . 3D object representations for fine-grained categorization [C ] // 2013 IEEE International Conference on Computer Vision Workshops . Piscataway : IEEE , 2013 : 554 - 561 .
MAJI S , RAHTU E , KANNALA J , et al . Fine-grained visual classification of aircraft [EB/OL ] . ( 2013-06-21 )[ 2024-06-05 ] . https://arxiv.org/abs/1306.5151v1 https://arxiv.org/abs/1306.5151v1 .
PARKHI O M , VEDALDI A , ZISSERMAN A , et al . Cats and dogs [C ] // 2012 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2012 : 3498 - 3505 .
NILSBACK M E , ZISSERMAN A . Automated flower classification over a large number of classes [C ] // 2008 Sixth Indian Conference on Computer Vision , Graphics & Image Processing . Piscataway : IEEE , 2008 : 722 - 729 .
LOSHCHILOV I , HUTTER F . Decoupled weight decay regularization [EB/OL ] . ( 2017-11-14 )[ 2024-06-05 ] . https://arxiv.org/abs/1711.05101v3 https://arxiv.org/abs/1711.05101v3 .
ZHANG R R , FANG R Y , ZHANG W , et al . Tip-adapter: Training-free CLIP-adapter for better vision-language modeling [EB/OL ] . ( 2021-11-06 )[ 2024-06-05 ] . https://arxiv.org/abs/2111.03930v2 https://arxiv.org/abs/2111.03930v2 .
JIE S B , DENG Z H . Convolutional bypasses are better vision transformer adapters [EB/OL ] . ( 2022-07-14 )[ 2024-06-05 ] . https://arxiv.org/abs/2207.07039v3 https://arxiv.org/abs/2207.07039v3 .
LIN Z Q , YU S , KUANG Z Y , et al . Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 19325 - 19337 .
RECHT B , ROELOFS R , SCHMIDT L , et al . Do imagenet classifiers generalize to imagenet? [C ] // International Conference on Machine Learning . New York : PMLR , 2019 : 5389 - 5400 .
WANG H H , GE S W , LIPTON Z , et al . Learning robust global representations by penalizing local predictive power [C ] // 33rd Conference on Neural Information Processing Systems . Vancouver : NeurIPS , 2019 : 10506 - 10518 .
0
Views
47
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621