从博弈论视角解构去噪扩散概率模型的视觉概念生成机制

刘超一; 耿浩棒; 葛亚维; 林晗; 侯娜; 赵二虎; 黄礼泊; 徐勇军

doi:10.12263/DZXB.20250716

您当前的位置：

首页 >

文章列表页 >

从博弈论视角解构去噪扩散概率模型的视觉概念生成机制

学术论文 | 更新时间：2026-02-10

- 从博弈论视角解构去噪扩散概率模型的视觉概念生成机制
- Disentangling the Visual Concept Generation of Denoising Diffusion Probabilistic Model from a Game-Theoretic View
- 电子学报 2025年53卷第11期页码：3910-3919
- 作者机构：
  
  1.中国科学院计算技术研究所，北京 100190
  2.军事科学院，北京 100091
  3.中国人民解放军32801部队，北京 100082
- 作者简介：
  
  [ "刘超一男，1998年5月出生于山东省济南市.现为中国科学院计算技术研究所博士研究生.主要研究方向为领域泛化.E-mail: liuchaoyi22@mails.ucas.ac.cn" ]
  [ "耿浩棒男，2000年10月出生于河南省郑州市.2024年6月毕业于中国科学院计算技术研究所.主要研究方向为基于扩散模型的视觉生成.E-mail: haobang.geng@kunlun-inc.com" ]
  [ "葛亚维男，1990年10月出生于山东省枣庄市.现为军事科学院战略评估咨询中心助理研究员.主要研究方向为军事评估与运筹决策.E-mail: vvrues11@163.com" ]
  [ "林晗男，1998年2月出生于福建省莆田市.现为军事科学院战略评估咨询中心博士研究生.主要研究方向为军事评估及因果推断技术.E-mail: lh98cool@163.com" ]
  [ "赵二虎男，1985年9月出生于河北省邢台市.博士、高级工程师、硕士生导师，就职于中国科学院计算技术研究所，任装备智能系统研究中心智算平台研究组组长.主要研究方向为嵌入式智能计算系统、专用计算机系统、芯片微系统结构.E-mail: zhaoerhu@ict.ac.cn" ]
  [ "黄礼泊男，1992年7月出生于江西省吉安市.现为中国科学院计算技术研究所助理研究员.主要方向为机器学习与人工智能.E-mail: huanglibo@ict.ac.cn" ]
  [ "徐勇军男，1979年7月出生于四川省成都市.中国科学院计算技术研究所正高级工程师、研究员、博士生导师，现任该所专项技术研究中心主任、国防科工局“华罗庚”创新中心常务副主任.主要研究方向为人工智能系统、大数据处理技术.E-mail: xyj@ict.ac.cn" ]
- 基金信息：
  
  北京市自然科学基金(4244098)
- DOI：10.12263/DZXB.20250716
  中图分类号： TP18;TP391.41
- 收稿：2025-08-15，
  
  录用：2025-11-13，
  
  纸质出版：2025-11-25
- 稿件说明：
移动端阅览
刘超一, 耿浩棒, 葛亚维, 等. 从博弈论视角解构去噪扩散概率模型的视觉概念生成机制[J]. 电子学报, 2025, 53(11): 3910-3919.

LIU Chao-yi, GENG Hao-bang, GE Ya-wei, et al. Disentangling the Visual Concept Generation of Denoising Diffusion Probabilistic Model from a Game-Theoretic View[J]. Acta Electronica Sinica, 2025, 53(11): 3910-3919.
刘超一, 耿浩棒, 葛亚维, 等. 从博弈论视角解构去噪扩散概率模型的视觉概念生成机制[J]. 电子学报, 2025, 53(11): 3910-3919. DOI：10.12263/DZXB.20250716

LIU Chao-yi, GENG Hao-bang, GE Ya-wei, et al. Disentangling the Visual Concept Generation of Denoising Diffusion Probabilistic Model from a Game-Theoretic View[J]. Acta Electronica Sinica, 2025, 53(11): 3910-3919. DOI：10.12263/DZXB.20250716

摘要

去噪扩散概率模型（Denoising Diffusion Probabilistic Models，DDPMs）作为当前生成式AI领域的核心技术，在高质量图像合成任务中实现了革命性突破，但其内在工作机制长期被视为“黑箱”，严重制约了其在医疗影像、自动驾驶等高可信度要求场景中的规模化应用.现有研究多聚焦于对逆向去噪过程的宏观行为分析，缺乏对潜空间中不同语义区域间动态交互机制的细粒度解构，导致模型可解释性与精准操控能力之间存在显著鸿沟.本研究从视觉概念生成解耦的新视角，探索了去噪扩散概率模型的可解释性.该发现不仅从理论角度解释了局部性在DDPMs上的表现，还在下游应用中实现了细粒度的图像操控.受博弈论启发，本文提出采用沙普利值来评估区域间的交互作用.然而，单纯按传统定义计算沙普利值将面临时间复杂度上的可行性问题.为此，本文进一步提出一个定理及配套采样策略，将时间复杂度降至

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=100261215&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=100261200&type=

9.99066734

3.21733332

，其中

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=100261203&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=100261216&type=

2.45533323

2.28600001

代表区域数，

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=100261219&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=100261218&type=

2.28600001

为采样数.定性定量实验表明，采用本方法进行真实图像处理时，对比现有方法本文提出的方法在局部操控方面性能提升30%~55%.实际应用中，用户可针对性修改特定视觉概念而不会干扰其他区域.通过博弈论与DDPM的深度融合，不仅在理论上首次阐明了局部性在扩散模型中的数学本质与实现路径，更在实践中构建了首个具备语义解耦能力的可解释DDPM框架.

Abstract

Denoising diffusion probabilistic models (DDPMs)

as a core technology in the current generative AI field

have achieved revolutionary breakthroughs in high-quality image synthesis tasks. However

their internal working mechanisms have long been regarded as a “black box”

severely restricting their large-scale application in high-trust scenarios such as medical imaging and autonomous driving. Existing research mostly focuses on the macroscopic behavior analysis of the reverse denoising process

lacking fine-grained deconstruction of the dynamic interaction mechanisms among different semantic regions in the latent space

resulting in a significant gap between model interpretability and precise control ability. This study explores the interpretability of denoising diffusion probabilistic models from a new perspective of decoupled visual concept generation. The findings not only explain the manifestation of locality in DDPMs from a theoretical standpoint but also enable fine-grained image manipulation in downstream applications. Inspired by game theory

we propose to use Shapley values to evaluate the interactions between regions. However

calculating Shapley values according to the traditional definition would face feasibility issues in terms of time complexity. Therefore

we further propose a theorem and an accompanying sampling strategy to reduce the time complexity to

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=100261207&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=100261220&type=

9.99066734

3.21733332

where

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=100261223&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=100261222&type=

2.45533323

2.28600001

represents the number of regions and

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=100261239&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=100261252&type=

2.28600001

is the number of samples. Qualitative and quantitative experiments show that our method

when applied to real image processing

achieves a 30%~55% performance improvement in local manipulation compared with existing methods. In practical applications

users can modify specific visual concepts without interfering with other regions. Through the deep integration of game theory and DDPM

not only has the mathematical essence and implementation path of locality in diffusion models been theoretically clarified for the first time

but also the first interpretable DDPM framework with semantic decoupling capability has been constructed in practice.

关键词

Keywords

references

HO J , JAIN A , ABBEEL P . Denoising diffusion probabilistic models [EB/OL ] . ( 2020-12-16 )[ 2025-11-11 ] . https://arxiv.org/abs/2006.11239 https://arxiv.org/abs/2006.11239 .

NICHOL A , DHARIWAL P , RAMESH A , et al . GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models [EB/OL ] . ( 2022-03-08 )[ 2025-11-11 ] . https://arxiv.org/abs/2112.10741 https://arxiv.org/abs/2112.10741 .

SONG Y , ERMON S . Generative modeling by estimating gradients of the data distribution [C ] //Neural Information Processing Systems. Curran Associates Inc .: Red Hook , 2019 : 11918 - 1193 .

SONG J M , MENG C L , ERMON S . Denoising diffusion implicit models [EB/OL ] . ( 2022-10-05 )[ 2025-11-11 ] . https://arXiv.org/abs/2010.02502 https://arXiv.org/abs/2010.02502 .

DHARIWAL P , NICHOL A . Diffusion models beat GANs on image synthesis [EB/OL ] . ( 2021-06-01 )[ 2025-11-11 ] . https://arXiv.org/abs/2105.05233 https://arXiv.org/abs/2105.05233 .

SONG Y , ERMON S . Improved techniques for training score-based generative models [EB/OL ] . ( 2020-10-23 )[ 2025-11-11 ] . https://arXiv.org/abs/2006.09011 https://arXiv.org/abs/2006.09011 .

LUGMAYR A , DANELLJAN M , ROMERO A , et al . RePaint: Inpainting using denoising diffusion probabilistic models [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 11451 - 11461 .

CHOI J , KIM S , JEONG Y , et al . ILVR: Conditioning method for denoising diffusion probabilistic models [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2022 : 14347 - 14356 .

ROMBACH R , BLATTMANN A , LORENZ D , et al . High-resolution image synthesis with latent diffusion models [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 10674 - 10685 .

NICHOL A , DHARIWAL P . Improved denoising diffusion probabilistic models [EB/OL ] . ( 2021-02-18 )[ 2025-11-11 ] . https://arXiv.org/abs/2102.09672 https://arXiv.org/abs/2102.09672 .

GUO Z L , LEI C T , FANG L , et al . A gray-box attack against latent diffusion model-based image editing by posterior collapse [EB/OL ] . ( 2024-09-20 )[ 2025-11-11 ] . https://arXiv.org/abs/2408.10901 https://arXiv.org/abs/2408.10901 .

HO J , SALIMANS T . Classifier-free diffusion guidance [EB/OL ] . ( 2022-07-26 )[ 2025-11-11 ] . https://arxiv.org/abs/2207.12598 https://arxiv.org/abs/2207.12598 .

KWON M , JEONG J , UH Y . Diffusion models already have a semantic latent space [EB/OL ] . ( 2023-03-29 )[ 2025-11-11 ] . https://arXiv.org/abs/2210.10960 https://arXiv.org/abs/2210.10960 .

KIM G , KWON T , YE J C . DiffusionCLIP: Text-guided diffusion models for robust image manipulation [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 2416 - 2425 .

RAMESH A , DHARIWAL P , NICHOL A , et al . Systems and methods for hierarchical text-conditional image generation : U.S. Patent Application 18/419 , 675 [P ] . 2024-10-03 .

RAMESH A , DHARIWAL P , NICHOL A , et al . Hierarchical text-conditional image generation with CLIP latents [EB/OL ] . ( 2022-04-13 )[ 2025-11-11 ] . https://arXiv.org/abs/2204.06125 https://arXiv.org/abs/2204.06125 .

WANG Y H , YU J W , ZHANG J . Zero-shot image restoration using denoising diffusion null-space model [EB/OL ] . ( 2022-12-07 )[ 2025-11-11 ] . https://arXiv.org/abs/2212.00490 https://arXiv.org/abs/2212.00490 .

LI X M , HOU X Y , LOY C C . When StyleGAN meets stable diffusion: A W + adapter for personalized image generation [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 2187 - 2196 .

TSAI T H , TSENG Y W . BiSeNet V3: Bilateral segmentation network with coordinate attention for real-time semantic segmentation [J ] . Neurocomputing , 2023 , 532 : 33 - 42 .

SHAPLEY L S . A value for n-person games [J/OL ] . Annals of Mathematical Studies , 1953 . DOI: 10.1017/CBO9780511528446.003 http://dx.doi.org/10.1017/CBO9780511528446.003 .

DUBEY P , WEBER R J . Probabilistic values for games [J ] . Cowles Foundation for Research in Economics , 1977 .

SARKAR S , BABU A R , MOUSAVI S , et al . RL-CAM: Visual explanations for convolutional networks using reinforcement learning [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2023 : 3861 - 3869 .

ZHAO X , WANG L M , ZHANG Y F , et al . A review of convolutional neural networks in computer vision [J ] . Artificial Intelligence Review , 2024 , 57 ( 4 ): 99 .

SHRIKUMAR A , GREENSIDE P , SHCHERBINA A , et al . Not just a black box: Learning important features through propagating activation differences [EB/OL ] . ( 2017-04-11 )[ 2025-11-11 ] . https://arXiv.org/abs/1605.01713 https://arXiv.org/abs/1605.01713 .

CASTRO J , GÓMEZ D , TEJADA J . Polynomial calculation of the Shapley value based on sampling [J ] . Computers & Operations Research , 2009 , 36 ( 5 ): 1726 - 1730 .

RUIZ N , LI Y Z , JAMPANI V , et al . DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 22500 - 22510 .

YANG B X , GU S Y , ZHANG B , et al . Paint by example: Exemplar-based image editing with diffusion models [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 18381 - 18391 .

HIGGINS I , MATTHEY L , PAL A , et al . Beta-VAE: Learning basic visual concepts with a constrained variational framework [C ] // International Conference on Learning Representations . Appleton : ICLR , 2017 : 1 - 13 .

XIE R C , DU C , SONG P , et al . MUSE-VL: Modeling unified VLM through semantic discrete encoding [EB/OL ] . ( 2025-07-28 )[ 2025-11-11 ] . https://arXiv.org/abs/2411.17762 https://arXiv.org/abs/2411.17762 .

CHEN X , DUAN Y , HOUTHOOFT R , et al . InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets [EB/OL ] . ( 2016-06-12 )[ 2025-11-11 ] . https://arXiv.org/abs/1606.03657 https://arXiv.org/abs/1606.03657 .

BAU D , ZHOU B L , KHOSLA A , et al . Network dissection: Quantifying interpretability of deep visual representations [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2017 : 3319 - 3327 .

MAHENDRAN A , VEDALDI A . Understanding deep image representations by inverting them [C ] // 2015 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2015 : 5188 - 5196 .

ZHOU B L , KHOSLA A , LAPEDRIZA A , et al . Learning deep features for discriminative localization [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2016 : 2921 - 2929 .

SELVARAJU R R , COGSWELL M , DAS A , et al . Grad-CAM: Visual explanations from deep networks via gradient-based localization [J ] . International Journal of Computer Vision , 2020 , 128 ( 2 ): 336 - 359 .

LIANG H Y , OUYANG Z H , ZENG Y Y , et al . Training interpretable convolutional neural networks by differentiating class-specific filters [C ] // Computer Vision - ECCV 2020 . Cham : Springer , 2020 : 622 - 638 .

MBACKE S D , CLERC F , GERMAIN P . Statistical guarantees for variational autoencoders using PAC-Bayesian theory [EB/OL ] . ( 2023-12-07 )[ 2025-11-11 ] . https://arXiv.org/abs/2310.04935 https://arXiv.org/abs/2310.04935 .

KUMAR A , SATTIGERI P , BALAKRISHNAN A . Variational inference of disentangled latent concepts from unlabeled observations [EB/OL ] . ( 2018-12-27 )[ 2025-11-11 ] . https://arXiv.org/abs/1711.00848 https://arXiv.org/abs/1711.00848 .

TRAN L , YIN X , LIU X M . Disentangled representation learning GAN for pose-invariant face recognition [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2017 : 1283 - 1292 .

AVRAHAMI O , LISCHINSKI D , FRIED O . Blended diffusion for text-driven editing of natural images [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 18187 - 18197 .

HERTZ A , MOKADY R , TENENBAUM J , et al . Prompt-to-prompt image editing with cross attention control [EB/OL ] . ( 2022-08-02 )[ 2025-11-11 ] . https://arXiv.org/abs/2208.01626 https://arXiv.org/abs/2208.01626 .

KAWAR B , ZADA S , LANG O , et al . Imagic: Text-based real image editing with diffusion models [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 6007 - 6017 .

VALEVSKI D , KALMAN M , MOLAD E , et al . UniTune: Text-driven image editing by fine tuning a diffusion model on a single image [J ] . 2023 , 42 ( 4 ): 1 - 10 .

COUAIRON G , VERBEEK J , SCHWENK H , et al . DiffEdit: Diffusion-based semantic image editing with mask guidance [EB/OL ] . ( 2022-10-20 )[ 2025-11-11 ] . https://arXiv.org/abs/2210.11427 https://arXiv.org/abs/2210.11427 .

PREECHAKUL K , CHATTHEE N , WIZADWONGSA S , et al . Diffusion autoencoders: Toward a meaningful and decodable representation [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 10609 - 10619 .

LOCATELLO F , BAUER S , LUCIC M , et al . Challenging common assumptions in the unsupervised learning of disentangled representations [EB/OL ] . ( 2019-06-18 )[ 2025-11-11 ] . https://arXiv.org/abs/1811.12359 https://arXiv.org/abs/1811.12359 .

LIU Z W , LUO P , WANG X G , et al . Deep learning face attributes in the wild [C ] // 2015 IEEE International Conference on Computer Vision . Piscataway : IEEE , 2016 : 3730 - 3738 .

KARRAS T , AILA T , LAINE S , et al . Progressive growing of GANs for improved quality, stability, and variation [EB/OL ] . ( 2018-02-26 )[ 2025-11-11 ] . https://arXiv.org/abs/1710.10196 https://arXiv.org/abs/1710.10196 .

LI C , YAO K L , WANG J , et al . Interpretable generative adversarial networks [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 2 ): 1280 - 1288 .

HEUSEL M , RAMSAUER H , UNTERTHINER T , et al . GANs trained by a two time-scale update rule converge to a local Nash equilibrium [EB/OL ] . ( 2018-01-12 )[ 2025-11-11 ] . https://arxiv.org/abs/1706.08500 https://arxiv.org/abs/1706.08500 .

COLLINS E , BALA R , PRICE B , et al . Editing in style: Uncovering the local semantics of GANs [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 5770 - 5779 .

BRACK M , FRIEDRICH F , HINTERSDORF D , et al . SEGA: Instructing text-to-image models using semantic guidance [EB/OL ] . ( 2023-11-02 )[ 2025-11-11 ] . https://arxiv.org/abs/2301.12247 https://arxiv.org/abs/2301.12247 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于vGPU性能干扰感知的大模型推理负载资源高效配置方法

融合多维过程视角：一种基于上下文感知图注意力的业务流程预测框架

面向海洋气象预报的低时延智能物联网构建

存算一体技术研究现状

变色片大屏幕显示