Image Enhancement via Content Semantic-Aware Multimodal Fusion

ZHU Han-cheng; LIU Xin-yu; YAO Rui; SHAO Zhi-wen; ZHOU Yong; LI Lei-da

doi:10.12263/DZXB.20241088

您当前的位置：

首页 >

文章列表页 >

Image Enhancement via Content Semantic-Aware Multimodal Fusion

PAPERS | 更新时间：2025-12-10

- Image Enhancement via Content Semantic-Aware Multimodal Fusion
- ACTA ELECTRONICA SINICA Vol. 53, Issue 7, Pages: 2252-2265(2025)
- 作者机构：
  
  1.中国矿业大学计算机科学与技术学院/人工智能学院，江苏徐州 221116
  2.矿山数字化教育部工程研究中心，江苏徐州 221116
  3.西安电子科技大学人工智能学院，陕西西安 710126
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62101555;62172417;62472424;62272461;62106268)
- DOI：10.12263/DZXB.20241088
  CLC： TP391.4;
- Received：03 December 2024，
  
  Revised：2025-04-17，
  
  Published：25 July 2025
- 稿件说明：
移动端阅览
祝汉城, 刘新宇, 姚睿, 等. 基于内容语义感知多模态融合的图像增强方法[J]. 电子学报, 2025, 53(07): 2252-2265.

ZHU Han-cheng, LIU Xin-yu, YAO Rui, et al. Image Enhancement via Content Semantic-Aware Multimodal Fusion[J]. Acta Electronica Sinica, 2025, 53(07): 2252-2265.
祝汉城, 刘新宇, 姚睿, 等. 基于内容语义感知多模态融合的图像增强方法[J]. 电子学报, 2025, 53(07): 2252-2265. DOI：10.12263/DZXB.20241088

ZHU Han-cheng, LIU Xin-yu, YAO Rui, et al. Image Enhancement via Content Semantic-Aware Multimodal Fusion[J]. Acta Electronica Sinica, 2025, 53(07): 2252-2265. DOI：10.12263/DZXB.20241088

摘要

在图像增强方法中，基于曲线映射的修饰策略因其能够很好地保留图像的原始内容信息而成为研究的热点.现有的基于曲线映射方法通常只关注修饰前后图像色彩空间的映射关系，而忽略了图像内容对修饰结果的影响，导致具有相似色彩的不同图像内容修饰得不够精细和自然.针对上述问题，本文提出了一种基于内容语义感知多模态融合的图像增强方法，旨在通过引入描述图像内容语义感知信息的文本特征作为图像特征的补充，将图像和文本两个模态的特征进行融合得到内容语义感知的多模态特征，从而实现对图像不同内容的精细化修饰.首先，本文利用多模态大语言模型生成描述图像内容的文本信息，并将文本信息对图像的内容进行多模态提示学习，该方法能够使模型学习在内容文本信息的提示下对图像进行辅助增强；随后，提出了一种注意力机制将文本特征与图像特征进行充分交互融合生成多模态特征；最后，利用多模态特征建立修饰图像的曲线映射关系，从而可以有效地根据图像的内容进行针对性的修饰与增强.实验结果表明，本文提出方法在多个公开的基准数据集上取得了最优的性能表现，充分证明了融入内容语义感知信息在图像修饰任务上的有效性和优越性.

Abstract

Among image enhancement techniques

curve mapping-based retouching strategies have attracted significant research interest due to their ability to effectively retain the original content information of images. However

current curve-mapping methods primarily focus on the changes in color space before and after enhancement

often neglecting the influence of image content on the enhancement results. This limitation leads to suboptimal adjustments for images with similar colors but different content

resulting in less refined and natural enhancements. To address this issue

this paper proposes an image enhancement method based on content-aware multimodal fusion

which supplements image features by incorporating text features that describe the semantic perception of image content. By fusing features from both image and text modalities

the proposed approach captures multimodal content-aware semantics

enabling fine-grained adjustments tailored to different image content. Firstly

a multimodal large language model is employed to extract textual descriptions of image content

which are then used for multimodal prompt learning to guide the understanding of the image content. This method enables the model to leverage content-based text prompts for auxiliary image enhancement. Then

an attention mechanism is then applied to effectively integrate and fuse the textual and image features into a unified multimodal representation. Finally

this representation is used to construct a curve-mapping function

enabling content-specific image adjustments and enhancements. Experimental results on multiple public benchmark datasets demonstrate that the proposed method achieves state-of-the-art performance

highlighting the effectiveness and advantages of incorporating content-aware semantic information into image enhancement tasks.

关键词

Keywords

references

BYCHKOVSKY V , PARIS S , CHAN E , et al . Learning photographic global tonal adjustment with a database of input/output image pairs [C ] // CVPR 2011 . Piscataway : IEEE , 2011 : 97 - 104 .

RONNEBERGER O , FISCHER P , BROX T . U-Net: Convolutional networks for biomedical image segmentation [C ] // Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015 . Cham : Springer International Publishing , 2015 : 234 - 241 .

CHEN Q F , XU J , KOLTUN V . Fast image processing with fully-convolutional networks [C ] // 2017 IEEE International Conference on Computer Vision . Piscataway : IEEE , 2017 : 2516 - 2525 .

CHEN Y S , WANG Y C , KAO M H , et al . Deep photo enhancer: Unpaired learning for image enhancement from photographs with GANs [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 6306 - 6314 .

DOSOVITSKIY A , BEYER L , KOLESNIKOV A , et al . An image is worth 16 × 16 words: Transformers for image recognition at scale[EB/OL ] . ( 2020-10-22 )[ 2024-11-24 ] . https://arxiv.org/pdf/2010.11929/1000 https://arxiv.org/pdf/2010.11929/1000 .

ZAMIR S W , ARORA A , KHAN S , et al . Restormer: Efficient transformer for high-resolution image restoration [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 5718 - 5729 .

WANG T , ZHANG K H , SHEN T R , et al . Ultra-high-definition low-light image enhancement: A benchmark and transformer-based method [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2023 , 37 ( 3 ): 2654 - 2662 .

GHARBI M , CHEN J W , BARRON J T , et al . Deep bilateral learning for real-time image enhancement [J ] . ACM Transactions on Graphics , 2017 , 36 ( 4 ): 1 - 12 .

GUO C L , LI C Y , GUO J C , et al . Zero-reference deep curve estimation for low-light image enhancement [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 1777 - 1786 .

LI C Y , GUO C L , ZHOU S C , et al . FlexiCurve: Flexible piecewise curves estimation for photo retouching [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2023 : 1092 - 1101 .

ZENG H , CAI J R , LI L D , et al . Learning image-adaptive 3D lookup tables for high performance photo enhancement in real-time [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 44 ( 4 ): 2058 - 2073 .

WANG T , LI Y , PENG J Y , et al . Real-time image enhancer via learnable spatial-aware 3D lookup tables [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2021 : 2451 - 2460 .

LIU C X , YANG H , FU J L , et al . 4D LUT: Learnable context-aware 4D lookup table for image enhancement [J ] . IEEE Transactions on Image Processing , 2023 , 32 : 4742 - 4756 .

BIANCO S , CUSANO C , PICCOLI F , et al . Personalized image enhancement using neural spline color transforms [J ] . IEEE Transactions on Image Processing , 2020 , 29 : 6223 - 6236 .

SONG Y D , QIAN H , DU X . StarEnhancer: Learning real-time and style-aware image enhancement [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2021 : 4106 - 4115 .

YANG C Q , JIN M G , XU Y , et al . SepLUT: Separable image-adaptive lookup tables for real-time image enhancement [C ] // Computer Vision-ECCV 2022 . Cham : Springer Nature Switzerland , 2022 : 201 - 217 .

SERRANO-LOZANO D , HERRANZ L , BROWN M S , et al . NamedCurves: Learned image enhancement via color Naming [C ] // Computer Vision-ECCV 2024 . Cham : Springer Nature Switzerland , 2024 : 92 - 108 .

MORAN S A , MCDONAGH S , SLABAUGH G . CURL: Neural curve layers for global image enhancement [C ] // 2020 25th International Conference on Pattern Recognition . Piscataway : IEEE , 2021 : 9796 - 9803 .

YANG C Q , JIN M G , JIA X , et al . AdaInt: Learning adaptive intervals for 3D lookup tables on real-time image enhancement [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 17501 - 17510 .

RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [C ] //MEILA M, ZHANG T. Proceedings of the 38th International Conference on Machine Learning . New York : PMLR , 2021 : 8748 - 8763 .

LONG J , SHELHAMER E , DARRELL T . Fully convolutional networks for semantic segmentation [C ] // 2015 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2015 : 3431 - 3440 .

GOODFELLOW I , POUGET-ABADIE J , MIRZA M , et al . Generative adversarial networks [J ] . Communications of the ACM , 2020 , 63 ( 11 ): 139 - 144 .

SONG Y D , QIAN H , DU X . Multi-curve translator for high-resolution photorealistic image translation [C ] // Computer Vision-ECCV 2022 . Cham : Springer Nature Switzerland , 2022 : 126 - 143 .

BERLIN B , KAY P . Basic Color Terms: Their Universality and Evolution [M ] . Berkeley : University of California Press , 1991 .

HUANG X , BELONGIE S . Arbitrary style transfer in real-time with adaptive instance normalization [C ] // 2017 IEEE International Conference on Computer Vision . Piscataway : IEEE , 2017 : 1510 - 1519 .

KARAIMER H C , BROWN M S . A software platform for manipulating the camera imaging pipeline [C ] // Computer Vision-ECCV 2016 . Cham : Springer International Publishing , 2016 : 429 - 444 .

LI C Y , ZHANG B , HONG D F , et al . CasFormer: Cascaded transformers for fusion-aware computational hyperspectral imaging [J ] . Information Fusion , 2024 , 108 : 102408 .

李晨玉 , 洪丹枫 , 张兵 . 深度展开网络的高光谱异常探测 [J ] . 遥感学报 , 2024 , 28 ( 1 ): 69 - 77 .

LI C Y , HONG D F , ZHANG B . Deep unfolding network for hyperspectral anomaly detection [J ] . National Remote Sensing Bulletin , 2024 , 28 ( 1 ): 69 - 77 . (in Chinese)

LIANG Z X , LI C Y , ZHOU S C , et al . Iterative prompt learning for unsupervised backlit image enhancement [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 8060 - 8069 .

KOSUGI S . Prompt-guided image-adaptive neural implicit lookup tables for interpretable image enhancement [C ] // Proceedings of the 32nd ACM International Conference on Multimedia . New York : ACM , 2024 : 6463 - 6471 .

CHEN W , KE Q , LI Z . CLIP guided image-perceptive prompt learning for image enhancement [EB/OL ] . ( 2023-11-07 )[ 2024-11-24 ] . http://arxiv.org/abs/2311.03943 http://arxiv.org/abs/2311.03943 .

LEE H , KANG K , OK J , et al . CLIPtone: Unsupervised learning for text-based image tone adjustment [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 2942 - 2951 .

OPENAI , ACHIAM J , ADLER S , et al . GPT-4 technical re-port [EB/OL ] . ( 2023-03-15 )[ 2024-11-24 ] . http://arxiv.org/abs/2303.08774 http://arxiv.org/abs/2303.08774 .

ZHAO H , GALLO O , FROSIO I , et al . Loss functions for image restoration with neural networks [J ] . IEEE Transactions on Computational Imaging , 2017 , 3 ( 1 ): 47 - 57 .

WANG R X , ZHANG Q , FU C W , et al . Underexposed photo enhancement using deep illumination estimation [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2019 : 6842 - 6850 .

KINGMA D P , BA J . Adam: A method for stochastic optimization [EB/OL ] . ( 2014-12-22 )[ 2024-11-24 ] . http://arxiv.org/abs/1412.6980 http://arxiv.org/abs/1412.6980 .

LOSHCHILOV I , HUTTER F . SGDR: Stochastic gradient descent with warm restarts [EB/OL ] . ( 2016-08-13 )[ 2024-11-24 ] . https://openreview.net/pdf?id=Skq89Scxx https://openreview.net/pdf?id=Skq89Scxx .

HORÉ A , ZIOU D . Image quality metrics: PSNR vs . SSIM [C ] // 2010 20th International Conference on Pattern Recognition . Piscataway : IEEE , 2010 : 2366 - 2369 .

MORAN S A , MARZA P , MCDONAGH S , et al . DeepLPF: Deep local parametric filters for image enhancement [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 12823 - 12832 .

HE J W , LIU Y H , QIAO Y , et al . Conditional sequential modulation for efficient global image retouching [C ] // Computer Vision-ECCV 2020 . Cham : Springer International Publishing , 2020 : 679 - 695 .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2016 : 770 - 778 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Multimodal Intent Recognition Based on Hierarchical Semantic-Consistency Learning

A Two-Stage Framework for Complex Question Generation over Knowledge Graph

Progressive Edge-Aware Interactive Network for Multi-Degraded Low-Light Image Enhancement

Quality Assessment of Light Field Images Based on Contrastive Visual-Textual Model

Related Author

PENG Jun-jie

LI Zheng-yi

ZHANG Huan-xiang

WANG Lan

ZHANG Kun

WANG Yuan-zhuo

QIU Yun-qi

BAI Long

Related Institution

School of Computer Engineering and Science, Shanghai University

School of Innovation and Entrepreneurship Education, Inner Mongolia University of Science and Technology

Research Center for Data Intelligence Systems, Institute of Computing Technology, Chinese Academy of Sciences

School of Computer Science and Technology, University of Chinese Academy of Sciences

Big Data Academy, Zhongke

⁰