基于语义增强与纹理-运动融合的说话人无关视觉配音方法

陈燚雷; 熊盛武

doi:10.12263/DZXB.20240685

您当前的位置：

首页 >

文章列表页 >

基于语义增强与纹理-运动融合的说话人无关视觉配音方法

学术论文 | 更新时间：2026-02-05

- 基于语义增强与纹理-运动融合的说话人无关视觉配音方法
- Speaker-Independent Visual Dubbing Method Based on Semantic Enhancement and Texture-Motion Fusion
- 电子学报 2025年53卷第10期页码：3608-3621
- 作者机构：
  
  1.湖北经济学院数字金融创新湖北省重点实验室，湖北武汉 430205
  2.武汉理工大学计算机与人工智能学院，湖北武汉 430070
  3.湖北经济学院信息工程学院，湖北武汉 430205
- 作者简介：
  
  [ "陈燚雷男，1992年3月出生于湖北省武汉市.现为湖北经济学院数字金融创新湖北省重点实验室研究员.主要研究方向为计算机视觉和少样本人脸视频生成." ]
  [ "熊盛武男， 1966年11月出生于湖北省咸宁市.现为武汉理工大学计算机科学与人工智能学院及武汉学院跨学科人工智能研究所教授.主要研究方向为智能计算、机器学习和模式识别.E-mail: xiongsw@whut.edu.cn" ]
- 基金信息：
  
  国家重点研发计划(2022ZD0160604)
- DOI：10.12263/DZXB.20240685
  中图分类号： TP37;
- 收稿：2024-07-22，
  
  录用：2025-10-23，
  
  纸质出版：2025-10-25
- 稿件说明：
移动端阅览
陈燚雷, 熊盛武. 基于语义增强与纹理-运动融合的说话人无关视觉配音方法[J]. 电子学报, 2025, 53(10): 3608-3621.

CHEN Yi-lei, XIONG Sheng-wu. Speaker-Independent Visual Dubbing Method Based on Semantic Enhancement and Texture-Motion Fusion[J]. Acta Electronica Sinica, 2025, 53(10): 3608-3621.
陈燚雷, 熊盛武. 基于语义增强与纹理-运动融合的说话人无关视觉配音方法[J]. 电子学报, 2025, 53(10): 3608-3621. DOI：10.12263/DZXB.20240685

CHEN Yi-lei, XIONG Sheng-wu. Speaker-Independent Visual Dubbing Method Based on Semantic Enhancement and Texture-Motion Fusion[J]. Acta Electronica Sinica, 2025, 53(10): 3608-3621. DOI：10.12263/DZXB.20240685

摘要

说话人无关的视觉配音技术旨在通过语音信号驱动说话人脸视频中唇部区域的运动，实现音视频的高度同步与自然融合.该技术不仅要求编辑后的视频具备良好的语音-视频同步性，还需保持面部纹理与身份特征的一致性.然而，现有方法在处理存在自然头部运动的视频时，常出现修复区域与真实人脸区域纹理不一致的问题，导致生成质量下降.为解决上述难题，本文提出了一种跨模态语义增强与3D人脸引导的运动纹理协同生成网络.该方法以三维可变形人脸模型（3D Morphable Model，3DMM）作为中间表示，将任务分解为语音驱动的3D表情系数预测与运动-纹理协同的人脸渲染两个子任务.首先，设计了跨模态语义增强的3DMM表情系数预测网络，通过引入Wav2Lip生成的语义图像序列与局部跨模态注意力机制，显著提升了语音-视频的同步率与几何一致性.其次，提出3D人脸引导的运动纹理协同渲染网络，利用多参考人脸与3D重建人脸进行纹理补偿与细节增强，并构建多任务学习框架以保证修复区域与真实人脸的纹理一致性.在VoxCeleb1和VoxCeleb2数据集上的大量实验表明，本文所提方法在生成保真度、运动鲁棒性和同步性方面均优于现有代表性方法.与基线模型相比，本方法在VoxCeleb1数据集上实现了峰值信噪比（Peak Signal Noise Ratio，PSNR）提升7.76，学习感知图像块相似度（Learned Perceptual Image Patch Similarity，LPIPS）降低0.08，结构相似性指标（Structural Similarity Index Measure，SSIM）提升0.11，人脸关键点距离（Landmark Distance，LMD）降低1.10，音画同步评分（Lip-Sync Score，Sync）得分提高0.20；在 VoxCeleb2数据集上，分别实现了PSNR提升7.12，LPIPS降低0.10，SSIM提升0.11，LMD降低1.10，Sync得分提高0.15.实验结果充分验证了所提方法在复杂头部运动与多样身份条件下的有效性与优越性.

Abstract

Speaker-independent visual dubbing aims to edit the lip movements of talking face videos according to speech signals

ensuring high audio-visual synchronization and natural fidelity. This task not only requires accurate lip-sync performance but also demands consistent facial texture and identity preservation. However

existing methods often suffer from texture inconsistencies between the restored and original facial regions when natural head movements occur

leading to unstable generation quality. To address these challenges

this paper proposes a cross-modal semantic enhanced and 3D face-guided motion-texture synergistic generation network. Specifically

we adopt 3D morphable models (3DMM) as an intermediate representation and decompose the task into two submodules: cross-modal semantic enhanced 3DMM expression coefficient prediction and 3D face-guided motion-texture synergistic rendering. In the first stage

a cross-modal attention mechanism integrates Wav2Lip-generated semantic image sequences with audio features

significantly improving synchronization accuracy and geometric consistency. In the second stage

a 3D face-guided rendering network leverages multi-reference faces and reconstructed 3D geometry to enhance texture consistency under head motion

while a multi-task learning framework further refines visual fidelity between the restored and real facial regions. Extensive experiments on the VoxCeleb1 and VoxCeleb2 datasets demonstrate that the proposed method achieves superior performance in generation fidelity

motion robustness

and synchronization compared with state-of-the-art approaches. On VoxCeleb1

our method improves peak signal noise ratio (PSNR) by 7.76

reduces learned perceptual image patch similarity (LPIPS) by 0.08

increases structural similarity index measure (SSIM) by 0.11

decreases landmark distance (LMD) by 1.10

and improves lip-sync score (Sync) by 0.20 over the baseline. On VoxCeleb2

it improves PSNR by 7.12

reduces LPIPS by 0.10

increases SSIM by 0.11

decreases LMD by 1.10

and improves Sync by 0.15. These results verify the effectiveness and robustness of the proposed framework under complex head movements and diverse identities.

关键词

Keywords

references

LIU M Y , HUANG X , YU J , et al . Generative adversarial networks for image and video synthesis: Algorithms and applications [J ] . Proceedings of the IEEE , 2021 , 109 ( 5 ): 839 - 862 .

CHEN L L , MADDOX R K , DUAN Z Y , et al . Hierarchical cross-modal talking face generation with dynamic pixel-wise loss [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 7824 - 7833 .

ESKIMEZ S E , MADDOX R K , XU C , et al . Noise-resilient training method for face landmark generation from speech [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2019 , 28 : 27 - 38 .

VOUGIOUKAS K , PETRIDIS S , PANTIC M . Realistic speech-driven facial animation with GANs [J ] . International Journal of Computer Vision , 2019 , 128 ( 5 ): 1398 - 1413 .

CHEN L L , CUI G F , LIU C L , et al . Talking-head generation with rhythmic head motion [C ] // Computer Vision - ECCV 2020 . Cham : Springer , 2020 : 35 - 51 .

DAS D , BISWAS S , SINHA S , et al . Speech-driven facial animation using cascaded GANs for learning of motion and texture [C ] // Computer Vision - ECCV 2020 . Cham : Springer , 2020 : 408 - 424 .

PRAJWAL K R , MUKHOPADHYAY R , NAMBOODIRI V P , et al . A lip sync expert is all you need for speech to lip generation in the wild [C ] // Proceedings of the 28th ACM International Conference on Multimedia . New York : ACM , 2020 : 484 - 492 .

PARK S J , KIM M , HONG J , et al . SyncTalkFace: Talking face generation with precise lip-syncing via audio-lip memory [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 2 ): 2062 - 2070 .

WANG J D , QIAN X Y , ZHANG M L , et al . Seeing what you said: Talking face generation guided by a lip reading expert [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 14653 - 14662 .

ZHANG W X , CUN X D , WANG X , et al . SadTalker: Learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 8652 - 8661 .

EGGER B , SMITH W A , TEWARI A , et al . 3D morphable face models—Past, present, and future [J ] . ACM Transactions on Graphics (ToG) , 2020 , 39 ( 5 ): 1 - 38 .

MUKHOPADHYAY S , SURI S , GADDE R T , et al . Diff2Lip: Audio conditioned diffusion models for lip-synchronization [C ] // 2024 IEEE/CVF Winter Conference on Applications of Computer Vision . Piscataway : IEEE , 2024 : 5280 - 5290 .

XIE T Y , LIAO L C , BI C , et al . Towards realistic visual dubbing with heterogeneous sources [C ] // Proceedings of the 29th ACM International Conference on Multimedia . New York : ACM , 2021 : 1739 - 1747 .

ZHONG W Z , FANG C W , CAI Y Q , et al . Identity-preserving talking face generation with landmark and appearance priors [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 9729 - 9738 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [EB/OL ] . ( 2023-08-02 )[ 2025-10-01 ] . https://arxiv.org/abs/1706.03762 https://arxiv.org/abs/1706.03762 .

CHENG K , CUN X D , ZHANG Y , et al . Videoretalking: Audio-based lip synchronization for talking head video editing in the wild [C ] // SIGGRAPH Asia 2022 Conference Papers . New York : ACM , 2022 : 1 - 9 .

SUN Y S , ZHOU H , WANG K , et al . Masked lip-sync prediction by audio-visual contextual exploitation in transformers [C ] // SIGGRAPH Asia 2022 Conference Papers . New York : ACM , 2022 : 1 - 9 .

SONG L , WU W , FU C , et al . Audio-driven dubbing for user generated contents via style-aware semi-parametric synthesis [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2022 , 33 ( 3 ): 1247 - 1261 .

HUANG X , BELONGIE S . Arbitrary style transfer in real-time with adaptive instance normalization [C ] // 2017 IEEE International Conference on Computer Vision . Piscataway : IEEE , 2017 : 1510 - 1519 .

YANG S L , WANG W , LING J , et al . Context-aware talking-head video editing [C ] // Proceedings of the 31st ACM International Conference on Multimedia . New York : ACM , 2023 : 7718 - 7727 .

GUAN J Z , ZHANG Z W , ZHOU H , et al . StyleSync: High-fidelity generalized and personalized lip sync in style-based generator [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 1505 - 1515 .

KI T , MIN D C . StyleLipSync: Style-based personalized lip-sync video generation [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2024 : 22784 - 22793 .

GUAN J Z , XU Z L , ZHOU H , et al . Resyncer: Rewiring style-based generator for unified audio-visually synced facial performer [EB/OL ] . ( 2024-08-06 )[ 2025-10-01 ] . https://arxiv.org/abs/2408.03284 https://arxiv.org/abs/2408.03284 .

ZHANG L H , LIANG S , GE Z P , et al . PersonaTalk: Bring at-tention to your persona in visual dubbing [EB/OL ] . ( 2024-09-09 )[ 2025-10-01 ] . https://arxiv.org/abs/2409.05379 https://arxiv.org/abs/2409.05379 .

PAYSAN P , KNOTHE R , AMBERG B , et al . A 3D face model for pose and illumination invariant face recognition [C ] // Proceedings of the 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance . New York : ACM , 2009 : 296 - 301 .

CAO C , WENG Y , ZHOU S , et al . Facewarehouse: A 3d facial expression database for visual computing [J ] . IEEE Transactions on Visualization and Computer Graphics , 2013 , 20 ( 3 ): 413 - 25 .

DENG Y , YANG J L , XU S C , et al . Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2020 : 285 - 295 .

RAMAMOORTHI R , HANRAHAN P . An efficient representation for irradiance environment maps [C ] // Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques . New York : ACM , 2001 : 497 - 500 .

JADERBERG M , SIMONYAN K , ZISSERMAN A . Spatial transformer networks [EB/OL ] . ( 2016-02-04 )[ 2025-10-02 ] . https://arxiv.org/abs/1506.02025 https://arxiv.org/abs/1506.02025 .

PARK T , LIU M Y , WANG T C , et al . Semantic image synthesis with spatially-adaptive normalization [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 2332 - 2341 .

姜文涛 , 高原 , 袁姮 , 等 . 门控机制的图像分类网络 [J ] . 电子学报 , 2024 , 52 ( 7 ): 2393 - 2406 .

JIANG W T , GAO Y , YUAN H , et al . Image classification network of gating mechanism [J ] . Acta Electronica Sinica , 2024 , 52 ( 7 ): 2393 - 2406 . (in Chinese)

GENG D , HAMILTON M , OWENS A . Comparing correspondences: Video prediction with correspondence-wise losses [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 3355 - 3366 .

JOHNSON J , ALAHI A , LI F F . Perceptual losses for real-time style transfer and super-resolution [C ] // Computer Vision - ECCV 2016 . Cham : Springer , 2016 : 694 - 711 .

CHUNG J S , ZISSERMAN A . Lip reading in the wild [M ] // Computer Vision - ACCV 2016 . Cham : Springer International Publishing , 2017 : 87 - 103 .

NAGRANI A , CHUNG J S , XIE W , et al . Voxceleb: Large-scale speaker verification in the wild [J ] . Computer Speech & Language , 2020 , 60 : 101027 .

CHUNG J S , NAGRANI A , ZISSERMAN A . VoxCeleb2: Deep speaker recognition [EB/OL ] . ( 2018-06-27 )[ 2025-09-20 ] . https://arxiv.org/abs/1806.05622 https://arxiv.org/abs/1806.05622 .

SIAROHIN A , LATHUILIÈRE S , TULYAKOV S , et al . First order motion model for image animation [EB/OL ] . ( 2020-10-01 )[ 2025-10-01 ] . https://arxiv.org/abs/2003.00196 https://arxiv.org/abs/2003.00196 .

BULAT A , TZIMIROPOULOS G . How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230, 000 3D facial landmarks) [C ] // 2017 IEEE International Conference on Computer Vision . Piscataway : IEEE , 2017 : 1021 - 1030 .

LIU L Y , JIANG H M , HE P C , et al . On the variance of the adaptive learning rate and beyond [EB/OL ] . ( 2021-10-26 )[ 2025-10-01 ] . https://arxiv.org/abs/1908.03265 https://arxiv.org/abs/1908.03265 .

ZHANG M R , LUCAS J , HINTON G , et al . Lookahead optimizer: K steps forward, 1 step back [EB/OL ] . ( 2019-12-03 )[ 2025-10-20 ] . https://arxiv.org/abs/1907.08610 https://arxiv.org/abs/1907.08610 .

YI R , YE Z P , ZHANG J Y , et al . Audio-driven talking face video generation with learning-based personalized head pose [EB/OL ] . ( 2020-03-05 )[ 2025-10-20 ] . https://arxiv.org/abs/2002.10137 https://arxiv.org/abs/2002.10137 .

ZHOU Y , HAN X , SHECHTMAN E , et al . MakeltTalk: speaker-aware talking-head animation [J ] . ACM Trans Graph , 2020 , 39 ( 6 ): 1 - 15 .

ZAKHAROV E , SHYSHEYA A , BURKOV E , et al . Few-shot adversarial learning of realistic neural talking head models [C ] // 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2020 : 9458 - 9467 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于层次化一致性语义学习的多模态意图识别

Na₅Eu(MO₄)₄(M=Mo,W)的B(λkq)强度参数和跃迁几率

YGG:Cr晶体的光谱特性

InP掺杂超晶格的MOCVD生长与表征

Xe^1v离子激光感生荧光用于医用X射线ZnCdS:Ag荧光屏的性能研究