基于对抗学习和增强优化的深度转换语音还原方法

苏兆品; 周晓琳; 张国富; 廉晨思; 王年松; 岳峰

doi:10.12263/DZXB.20240819

您当前的位置：

首页 >

文章列表页 >

基于对抗学习和增强优化的深度转换语音还原方法

大模型与互联网 | 更新时间：2025-10-16

- 基于对抗学习和增强优化的深度转换语音还原方法
- Adversarial Learning and Enhanced Optimization Based Restoration Method for VC-Generated Speeches
- 电子学报 2025年53卷第6期页码：1815-1828
- 作者机构：
  
  1.合肥工业大学计算机与信息学院，安徽合肥 230601
  2.安徽省公安厅物证鉴定管理处，安徽合肥 230000
  3.智能互联系统安徽省实验室（合肥工业大学），安徽合肥 230009
  4.音视频智能防识联合实验室，安徽合肥 230000
- 作者简介：
  
  [ "苏兆品女，1983年8月生，山东菏泽人.副教授，硕士生导师，CCF会员.2004年和2008年在合肥工业大学分别获得学士和博士学位.主要研究方向为音频信息隐藏、深度学习和进化计算.中国电子学会会员编号：E190027825M.E-mail: szp@hfut.edu.cn" ]
  [ "周晓琳男，1999年10月生，安徽蚌埠人.硕士研究生.2022年在淮北师范大学获得学士学位.主要研究方向为面向转换语音的溯源关键技术.E-mail: 2022171228@mail.hfut.edu.cn" ]
  [ "张国富男，1979年3月生，安徽合肥人.教授，硕士生导师，CCF、CAA会员.2002年和2008年在合肥工业大学分别获得学士和博士学位.现为工业安全与应急技术安徽省重点实验室副主任.主要研究方向为基于搜索的软件工程、音频安全和进化计算等.E-mail: zgf@hfut.edu.cn" ]
  [ "廉晨思女，硕士，高级工程师.主要研究方向为声纹鉴定.E-mail: lchsi324@163.com" ]
  [ "王年松男，2002年毕业于中国刑事警察学院公共安全图像专业，正高级.主要研究方向为多媒体取证.E-mail: 28640145@qq.com" ]
  岳峰 1981年2月生，安徽合肥人.副研究员，硕士生导师.2004年、2009年和2015年在合肥工业大学分别获得学士、硕士和博士学位.主要研究方向为软件工程、音频信息隐藏和进化计算.E-mail: yuefeng@huft.edu.cn
- 基金信息：
  
  教育部人文社会科学研究规划基金项目(24YJA870011);安徽省重点研究与开发计划项目(202104d07020001)
- DOI：10.12263/DZXB.20240819
  中图分类号： TP301;
- 收稿：2024-09-03，
  
  修回：2025-06-05，
  
  纸质出版：2025-06-25
- 稿件说明：
移动端阅览
苏兆品, 周晓琳, 张国富, 等. 基于对抗学习和增强优化的深度转换语音还原方法[J]. 电子学报, 2025, 53(06): 1815-1828.

SU Zhao-pin, ZHOU Xiao-lin, ZHANG Guo-fu, et al. Adversarial Learning and Enhanced Optimization Based Restoration Method for VC-Generated Speeches[J]. Acta Electronica Sinica, 2025, 53(06): 1815-1828.
苏兆品, 周晓琳, 张国富, 等. 基于对抗学习和增强优化的深度转换语音还原方法[J]. 电子学报, 2025, 53(06): 1815-1828. DOI：10.12263/DZXB.20240819

SU Zhao-pin, ZHOU Xiao-lin, ZHANG Guo-fu, et al. Adversarial Learning and Enhanced Optimization Based Restoration Method for VC-Generated Speeches[J]. Acta Electronica Sinica, 2025, 53(06): 1815-1828. DOI：10.12263/DZXB.20240819

摘要

语音转换（Voice Conversion，VC）是一种采用深度学习将源说话人声音转换为目标说话人声音的人工智能技术，不仅被广泛应用于电影配音、个性化语音定制等，也被恶意分子应用于电信诈骗、身份伪造、政治社会操纵等，给个人隐私、社会稳定乃至国家安全带来严重危害.相比较于深度转换语音的检测，如何由深度转换语音恢复出源说话声音，即深度转换语音还原，对追踪真实说话人，防止VC非法使用，具有更重要的研究意义和实用价值.而目前相关的研究还较少.为此，本文提出了一种基于对抗学习和增强优化的深度转换语音还原方法.具体来说，首先分析了深度转换语音与源语音和目标语音的相似度，提出基于初步还原-增强优化的深度转换语音还原框架.其次，基于动态卷积和注意力机制设计对抗还原网络，通过生成器、分类器和鉴别器的对抗学习，从转换语音中学习尽可能多的源说话人信息.然后，设计包含音色提取器、内容提取器和声码器的增强优化网络，将初步还原语音中的音色信息和深度转换语音中的内容信息进行深度融合，生成优化后的还原语音.最后，在Free-VC、TriAAN-VC、BNE-PPG-VC三种高性能语音转换模型的数据集上验证所提方法的有效性.对比实验结果表明，本文方法针对三种语音转换模型的还原语音，在与真实语音的平均余弦相似度上分别提高了11.9、8.7和7.1个百分点，在说话人验证系统的平均等错率EER（Equal-Error-Rate）上分别降低了4.30、3.40和3.98个百分点，说明本文方法不仅可以有效恢复出源说话人语音，而且对未知深度转换语音也有一定的适用性.

Abstract

Voice conversion is an artificial intelligence technology that uses deep learning to convert the voice of a source speaker into the voice of a target speaker. It is widely used not only in movie dubbing

personalized voice customization

etc.

but also used by malicious individuals in telecom fraud

identity forgery

political and social manipulation

etc.

posing serious threats to personal privacy

social stability

and even national security. Compared with the detection of VC-generated speeches

how to restore the source speech from VC-generated speeches

that is

VC-generated speeches restoration

has more important research significance and practical value for tracking real speakers and preventing the illegal use of VC technologies. However

there are still few related studies. In this paper

a restoration method for VC-generated speeches is proposed based on adversarial learning and enhancement optimization. Specifically

the similarity of the VC-generated speech with the source and target speech is first analyzed

and a restoration framework is present based on preliminary restoration-further optimization. Then

an adversarial restoration network is designed based on dynamic convolution and attention mechanisms

aiming to learn as much source speaker information as possible from VC-generated speech through adversarial learning of generator

classifier

and discriminator. After that

an enhanced optimization network

consisting of timbre extractor

content extractor

and sound encoder

is designed to generate optimized restored speech by deeply fusing timbre information in the preliminary restored speech and the content information in the deep converted speech. Finally

the effectiveness of the proposed method is validated on datasets of three high-performance speech conversion models: BNE-PPG-VC

TriAAN-VC

and Free VC. Comparative experimental results show that the restored speech for the three VC models improves the mean of cosine similarity with the source speech by 11.9

8.7

and 7.1 percentage points respectively

and reduces the mean of equal-error-rate of speaker verification system by 4.30

3.40

and 3.98 percentage points respectively

which indicates that the proposed method can not only effectively recover the source speaker speech

but also is also applicable to unknown VC-generated speech.

关键词

Keywords

references

QIAN J W , DU H H , HOU J H , et al . Speech sanitizer: Speech content desensitization and voice anonymization [J ] . IEEE Transactions on Dependable and Secure Computing , 2021 , 18 ( 6 ): 2631 - 2642 .

LAL SRIVASTAVA B M , VAUQUIER N , SAHIDULLAH M , et al . Evaluating voice conversion-based privacy protection against informed attackers [C ] // ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2020 : 2802 - 2806 .

MUKHNERI F M , WIJAYANTO I , HADIYOSO S . Voice conversion for dubbing using linear predictive coding and hidden Markov model [J ] . Journal of Southwest Jiaotong University , 2020 , 55 ( 4 ): 33 .

KANAGAWA H , NOSE T , KOBAYASHI T . Speaker-independent style conversion for HMM-based expressive speech synthesis [C ] // 2013 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2013 : 7864 - 7868 .

LUO Y J , HSU C C , AGRES K , et al . Singing voice conversion with disentangled representations of singer and vocal technique using variational autoencoders [C ] // ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2020 : 3277 - 3281 .

许裕雄 , 李斌 , 谭舜泉 , 等 . 语音深度伪造及其检测技术研究进展 [J ] . 中国图象图形学报 , 2024 , 29 ( 8 ): 2236 - 2268 .

XU Y X , LI B , TAN S Q , et al . Research progress on speech deepfake and its detection techniques [J ] . Journal of Image and Graphics , 2024 , 29 ( 8 ): 2236 - 2268 . (in Chinese)

TAK H , TODISCO M , WANG X , et al . Automatic speaker verification spoofing and deepfake detection using Wav2vec 2.0 and data augmentation [C ] // The Speaker and Language Recognition Workshop . ISCA , 2022 : 112 - 119 .

WANG L , YEOH B , NG J W . Synthetic voice detection and audio splicing detection using SE-Res2Net-conformer architecture [C ] // 2022 13th International Symposium on Chinese Spoken Language Processing . Piscataway : IEEE , 2022 : 115 - 119 .

YUE F , CHEN J L , SU Z P , et al . Audio Spoofing Detection Using Constant-Q Spectral Sketches and Parallel-Attention SE-ResNet [M ] // Computer Security-ESORICS 2022 . Cham : Springer Nature Switzerland , 2022 : 756 - 762 .

XUE J X , ZHOU H , SONG H W , et al . Cross-modal information fusion for voice spoofing detection [J ] . Speech Communication , 2023 , 147 : 41 - 50 .

REN Y Z , ZHU H C , ZHAI L M , et al . Who is speaking actually? Robust and versatile speaker traceability for voice conversion [C ] // Proceedings of the 31st ACM International Conference on Multimedia . New York : ACM , 2023 : 8674 - 8685 .

LIU S X , CAO Y W , WANG D S , et al . Any-to-many voice conversion with location-relative sequence-to-sequence modeling [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2021 , 29 : 1717 - 1728 .

PARK H J , YANG S W , KIM J S , et al . TriAAN-VC: Triple adaptive attention normalization for any-to-any voice conversion [C ] // ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 1 - 5 .

LI J Y , TU W P , XIAO L . FreeVC: Towards high-quality text-free one-shot voice conversion [C ] // ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 1 - 5 .

章子旭 , 简志华 . 采用双重交换表示分离的任意说话人语音转换 [J ] . 电子学报 , 2024 , 52 ( 6 ): 2141 - 2150 .

ZHANG Z X , JIAN Z H . Any-to-any voice conversion using double exchange representation separation [J ] . Acta Electronica Sinica , 2024 , 52 ( 6 ): 2141 - 2150 . (in Chinese)

YAMAGISHI J , VEAUX C , MACDONALD K , et al . CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) [J ] . University of Edinburgh. The Centre for Speech Technology Research (CSTR) , 2019 : 271 - 350 .

KAMEOKA H , KANEKO T , TANAKA K , et al . StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks [C ] // 2018 IEEE Spoken Language Technology Workshop . Piscataway : IEEE , 2018 : 266 - 273 .

黄赟 , 张帆 , 郭威 , 等 . 一种基于数据标准差的卷积神经网络量化方法 [J ] . 电子学报 , 2023 , 51 ( 3 ): 639 - 647 .

HUANG Y , ZHANG F , GUO W , et al . A quantification method of convolutional neural network based on data standard deviation [J ] . Acta Electronica Sinica , 2023 , 51 ( 3 ): 639 - 647 . (in Chinese)

CHEN Y P , DAI X Y , LIU M C , et al . Dynamic convolution: Attention over convolution kernels [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 11027 - 11036 .

ZHANG Q L , YANG Y B . SA-net: Shuffle attention for deep convolutional neural networks [C ] // ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2021 : 2235 - 2239 .

WAN L , WANG Q , PAPIR A , et al . Generalized end-to-end loss for speaker verification [C ] // 2018 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2018 : 4879 - 4883 .

KONG J , KIM J , BAE J . Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis [J ] . Advances in Neural Information Processing Systems , 2020 , 33 : 17022 - 17033 .

DESPLANQUES B , THIENPONDT J , DEMUYNCK K . ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification [C ] // Interspeech 2020 . ISCA , 2020 : 3830 - 3834 .

CHOI H Y , LEE S H , LEE S W . DDDM-VC: Decoupled denoising diffusion models with disentangled representation and prior mixup for verified robust voice conversion [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 16 ): 17862 - 17870 .

CHOI H Y , LEE S H , LEE S W . Diff-HierVC: Diffusion-based hierarchical voice conversion with robust pitch generation and masked prior for zero-shot speaker adaptation [C ] // INTERSPEECH 2023 . ISCA , 2023 : 2283 - 2287 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

跨模态渐进式知识迁移SAR目标检测技术

基于跨模态协同表示学习的二进制代码相似性检测方法

一种基于Transformer架构的多层级自动睡眠分期模型

基于条件变分推断与内省对抗学习的多样化图像描述生成