

浏览全部资源
扫码关注微信
中国人民解放军陆军工程大学指挥控制工程学院,江苏南京210007
Received:21 April 2025,
Accepted:16 September 2025,
Published:25 September 2025
移动端阅览
李豪, 郝文宁, 邹世辰, 等. 基于Diffusion-Mamba和尺度不变损失的渐进式图像生成方法[J]. 电子学报, 2025, 53(09): 3384-3396.
LI Hao, HAO Wen-ning, ZOU Shi-chen, et al. Progressive Image Synthesis Method Based on Diffusion-Mamba and Scale-Invariant Loss[J]. Acta Electronica Sinica, 2025, 53(09): 3384-3396.
李豪, 郝文宁, 邹世辰, 等. 基于Diffusion-Mamba和尺度不变损失的渐进式图像生成方法[J]. 电子学报, 2025, 53(09): 3384-3396. DOI:10.12263/DZXB.20250308
LI Hao, HAO Wen-ning, ZOU Shi-chen, et al. Progressive Image Synthesis Method Based on Diffusion-Mamba and Scale-Invariant Loss[J]. Acta Electronica Sinica, 2025, 53(09): 3384-3396. DOI:10.12263/DZXB.20250308
扩散模型在图像生成领域由于精度高而受到了广泛关注,其骨干网络经历了从U-Net到Transformer的演变.然而,由于Transformer的运算量与序列长度的平方成正比这一特性,导致扩散模型在处理高分辨率图像时存在计算复杂度高的问题.为了解决这一问题,本文提出一种基于Diffusion-Mamba和尺度不变损失的渐进式图像生成方法.该方法利用多方向扫描机制和轻量级局部结构增强模块融合了Mamba的高效特性以及扩散模型的建模能力,并通过渐进式级联扩散过程实现了从低分辨率图像向高分辨率图像的高效转换.此外,设计基于对比学习的尺度不变损失函数,通过最大化同一目标在不同分辨率下的互信息,实现了跨尺度特征表示的对齐与增强.在ImageNet(FID = 1.67)数据集上的实验结果表明:本文方法取得了综合精度的提高,充分验证了该方法的有效性和高效性.
Diffusion models have garnered significant attention in the field of image generation due to their high precision. The backbone networks of these models have evolved from U-Net to Transformer architectures. However
the computational complexity of Transformer-based models scales quadratically with sequence length
posing a substantial challenge for generating high-resolution images. To address this issue
we propose a novel progressive image synthesis method based on Diffusion-Mamba and scale-invariant loss. Our method leverages the efficient characteristics of Mamba and the powerful modeling capabilities of diffusion models by integrating multi-directional scanning mechanisms and lightweight local structure enhancement modules. It achieves an efficient transformation from low-resolution images to high-resolution images through a progressive cascaded diffusion process
significantly reducing computational complexity. Furthermore
we design a contrastive learning-based scale-invariant loss function that maximizes the mutual information of the same target across different resolutions
thereby aligning and enhancing cross-scale feature representations. Experimental results on the ImageNet (FID = 1.67) dataset demonstrate that our proposed method achieves comprehensive improvements in accuracy
effectively validating its efficacy and efficiency.
何琨 , 佘计思 , 张子君 , 等 . 基于引导扩散模型的自然对抗补丁生成方法 [J ] . 电子学报 , 2024 , 52 ( 2 ): 564 - 573 .
HE K , SHE J S , ZHANG Z J , et al . A guided diffusion-based approach to natural adversarial patch generation [J ] . Acta Electronica Sinica , 2024 , 52 ( 2 ): 564 - 573 . (in Chinese)
牛玉贞 , 张凌昕 , 兰杰 , 等 . 基于分频式生成对抗网络的非成对水下图像增强 [J ] . 电子学报 , 2025 , 53 ( 2 ): 527 - 544 .
NIU Y Z , ZHANG L X , LAN J , et al . FD-GAN: Frequency-decomposed generative adversarial network for unpaired underwater image enhancement [J ] . Acta Electronica Sinica , 2025 , 53 ( 2 ): 527 - 544 . (in Chinese)
罗会兰 , 敖阳 , 袁璞 . 一种生成对抗网络用于图像修复的方法 [J ] . 电子学报 , 2020 , 48 ( 10 ): 1891 - 1898 .
LUO H L , AO Y , YUAN P . Image inpainting using generative adversarial networks [J ] . Acta Electronica Sinica , 2020 , 48 ( 10 ): 1891 - 1898 . (in Chinese)
GOODFELLOW I , POUGET-ABADIE J , MIRZA M , et al . Generative adversarial Nets [C ] // The 27th Advances in Neural Information Processing Systems . New York : ACM , 2014 : 2672 - 2680 .
SOHL-DICKSTEIN J , WEISS E A , MAHESWARANATHAN N , et al . Deep unsupervised learning using nonequilibrium thermodynamics [C ] // The 32nd International Conference on Machine Learning . Cambridge : PMLR , 2015 : 2246 - 2255 .
VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C ] // The 31st Advances in Neural Information Processing Systems . New York : ACM , 2017 : 5998 - 6008 .
ROMBACH R , BLATTMANN A , LORENZ D , et al . High-resolution image synthesis with latent diffusion models [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 10674 - 10685 .
BAO F , NIE S , XUE K W , et al . All are worth words: A ViT backbone for diffusion models [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 22669 - 22679 .
SMITH J T H . Advancing Sequence Modeling with Deep State Space Methods [D ] . Stanford : Stanford University , 2024 .
KALMAN R E . A new approach to linear filtering and prediction problems [J ] . Journal of Basic Engineering , 1960 , 82 ( 1 ): 35 - 45 .
DAO T , GU A . Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality [EB/OL ] . ( 2024-05-31 )[ 2025-09-10 ] . https://arXiv.org/abs/2405.21060 https://arXiv.org/abs/2405.21060 .
ZHU L H , LIAO B C , ZHANG Q , et al . Vision Mamba: Efficient visual representation learning with bidirectional state space model [EB/OL ] . ( 2024-11-14 )[ 2025-09-10 ] . https://arXiv.org/abs/2401.09417 https://arXiv.org/abs/2401.09417 .
HO J , JAIN A , ABBEEL P . Denoising diffusion probabilistic models [C ] // The 34th Advances in Neural Information Processing Systems . New York : ACM , 2020 : 6840 - 6851 .
YANG X L , SHIH S M , FU Y L , et al . Your ViT is secretly a hybrid discriminative-generative diffusion model [EB/OL ] . ( 2022-08-16 )[ 2025-09-10 ] . https://arXiv.org/abs/2208.07791 https://arXiv.org/abs/2208.07791 .
PEEBLES W , XIE S N . Scalable diffusion models with transformers [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2024 : 4172 - 4182 .
HATAMIZADEH A , SONG J M , LIU G L , et al . DiffiT: Diffusion vision transformers for image generation [C ] // Computer Vision - ECCV 2024 . Cham : Springer , 2025 : 37 - 55 .
TENG J Y , ZHENG W D , DING M , et al . Relay diffusion: Unifying diffusion process across resolutions for image synthesis [EB/OL ] . ( 2023-09-04 )[ 2025-09-09 ] . https://arXiv.org/abs/2309.03350 https://arXiv.org/abs/2309.03350 .
YAN J N , GU J T , RUSH A M . Diffusion models without attention [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 8239 - 8249 .
FAN M , YU C , HUANG J . Scalable diffusion models with state space backbone [EB/OL ] . ( 2024-03-28 )[ 2025-09-10 ] . https://arxiv.org/abs/2402.05608 https://arxiv.org/abs/2402.05608 .
HU V T , BAUMANN S A , GUI M , et al . ZigMa: A DiT-style zigzag mamba diffusion model [C ] // Computer Vision - ECCV 2024 . Cham : Springer , 2025 : 148 - 166 .
PARK J , PARK J , XIONG Z Y , et al . Can Mamba learn how to learn a comparative study on in-context learning tasks [C ] // Proceedings of the 41st International Conference on Machine Learning . New York : ACM , 2024 : 39793 - 39812 .
TENG Y , WU Y , SHI H , et al . DiM: Diffusion mamba for efficient high-resolution image synthesis [EB/OL ] . ( 2024-07-10 )[ 2025-09-09 ] . https://arXiv.org/abs/2405.14224 https://arXiv.org/abs/2405.14224 .
刘少鹏 , 赵慧民 , 洪佳明 , 等 . 面向医学图像生成的鲁棒条件生成对抗网络 [J ] . 电子学报 , 2023 , 51 ( 2 ): 427 - 437 .
LIU S P , ZHAO H M , HONG J M , et al . Medical image synthesis using robust conditional GAN [J ] . Acta Electronica Sinica , 2023 , 51 ( 2 ): 427 - 437 . (in Chinese)
马宾 , 王一利 , 徐健 , 等 . 基于双向生成对抗网络的图像感知哈希算法 [J ] . 电子学报 , 2023 , 51 ( 5 ): 1405 - 1412 .
MA B , WANG Y L , XU J , et al . An image perceptual hash algorithm based on bidirectional generative adversarial network [J ] . Acta Electronica Sinica , 2023 , 51 ( 5 ): 1405 - 1412 . (in Chinese)
黄欣研 , 刘芳 , 鲍骞月 , 等 . 基于多任务学习和身份约束的生成对抗网络人脸校正识别方法 [J ] . 电子学报 , 2023 , 51 ( 10 ): 2936 - 2949 .
HUANG X Y , LIU F , BAO Q Y , et al . Multi-task learning and identity-constrained generative adversarial network for face frontalization and recognition [J ] . Acta Electronica Sinica , 2023 , 51 ( 10 ): 2936 - 2949 . (in Chinese)
SHANNON C E . A mathematical theory of communication [J ] . The Bell System Technical Journal , 1948 , 27 ( 3 ): 379 - 423 .
MACKAY D J C . Information Theory, Inference, and Learning Algorithms [M ] . Cambridge : Cambridge University Press , 2003 : 1 - 628 .
POOLE B , OZAIR S , VAN DEN OORD A , et al . On variational bounds of mutual information [C ] // The 36th International Conference on Machine Learning . Cambridge : PMLR , 2019 : 2412 - 2421 .
LI Y X , LIU M Y , WU Y , et al . Learning adaptive and view-invariant vision transformer for real-time UAV tracking [EB/OL ] . ( 2025-08-15 0 )[ 2025-09-09 ] . https://arxiv.org/abs/2412.20002 https://arxiv.org/abs/2412.20002 .
HJELM R D , FEDOROV A , LAVOIE-MARCHILDON S , et al . Learning deep representations by mutual information estimation and maximization [EB/OL ] . ( 2019-02-22 )[ 2025-09-09 ] . https://arXiv.org/abs/1808.06670 https://arXiv.org/abs/1808.06670 .
DENG J , DONG W , SOCHER R , et al . ImageNet: A large-scale hierarchical image database [C ] // 2009 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2009 : 248 - 255 .
HEUSEL M , RAMSAUER H , UNTERTHINER T , et al . GANs trained by a two time-scale update rule converge to a local Nash equilibrium [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 6629 - 6640 .
NASH C , MENICK J , DIELEMAN S . Generating images with sparse representations [EB/OL ] . ( 2021-03-05 )[ 2025-09-09 ] . https://arxiv.org/abs/2103.03841 https://arxiv.org/abs/2103.03841 .
SALIMANS T , GOODFELLOW I , ZAREMBA W , et al . Improved techniques for training GANs [C ] // Proceedings of the 30th International Conference on Neural Information Processing Systems . New York : ACM , 2016 : 2234 - 2242 .
KYNKÄÄNNIEMI T , KARRAS T , LAINE S , et al . Improved precision and recall metric for assessing generative models [C ] // Proceedings of the 32th Advances in Neural Information Processing Systems . New York : ACM , 2019 : 3929 - 3938
HO J , SALIMANS T . Classifier-free diffusion guidance [EB/OL ] . ( 2022-07-26 )[ 2025-09-09 ] . https://arxiv.org/abs/2207.12598 https://arxiv.org/abs/2207.12598 .
BROCK A , DONAHUE J , SIMONYAN K . Large scale GAN training for high fidelity natural image synthesis [EB/OL ] . ( 2019-02-25 )[ 2025-09-09 ] . https://arXiv.org/abs/1809.11096 https://arXiv.org/abs/1809.11096 .
SAUER A , SCHWARZ K , GEIGER A . StyleGAN-XL: Scaling StyleGAN to large diverse datasets [C ] // ACM SIGGRAPH 2022 Conference Proceedings . New York : ACM , 2022 : 1 - 10 .
DHARIWAL P , NICHOL A . Diffusion models beat GANs on image synthesis [C ] // Proceedings of the 35th International Conference on Neural Information Processing Systems . New York : ACM , 2021 : 8780 - 8794 .
GAO S H , ZHOU P , CHENG M M , et al . Masked diffusion Transformer is a strong image synthesizer [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2024 : 23107 - 23116 .
HO J , SAHARIA C , CHAN W , et al . Cascaded diffusion models for high fidelity image generation [J ] . Journal of Machine Learning Research , 2022 , 23 ( 1 ): 2249 - 2281 .
0
Views
30
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621