Any-to-Any Voice Conversion Using Double Exchange Representation Separation

ZHANG Zi-xu; JIAN Zhi-hua

doi:10.12263/DZXB.20230246

您当前的位置：

首页 >

文章列表页 >

Any-to-Any Voice Conversion Using Double Exchange Representation Separation

PAPERS | 更新时间：2025-12-11

- Any-to-Any Voice Conversion Using Double Exchange Representation Separation
- ACTA ELECTRONICA SINICA Vol. 52, Issue 6, Pages: 2141-2150(2024)
- 作者机构：
  
  杭州电子科技大学通信工程学院，浙江杭州 310018
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(61201301;61772166)
- DOI：10.12263/DZXB.20230246
  CLC： TP391;
- Received：17 March 2023，
  
  Revised：2023-06-16，
  
  Published：25 June 2024
- 稿件说明：
移动端阅览
章子旭, 简志华. 采用双重交换表示分离的任意说话人语音转换[J]. 电子学报, 2024, 52(06): 2141-2150.

ZHANG Zi-xu, JIAN Zhi-hua. Any-to-Any Voice Conversion Using Double Exchange Representation Separation[J]. Acta Electronica Sinica, 2024, 52(06): 2141-2150.
章子旭, 简志华. 采用双重交换表示分离的任意说话人语音转换[J]. 电子学报, 2024, 52(06): 2141-2150. DOI：10.12263/DZXB.20230246

ZHANG Zi-xu, JIAN Zhi-hua. Any-to-Any Voice Conversion Using Double Exchange Representation Separation[J]. Acta Electronica Sinica, 2024, 52(06): 2141-2150. DOI：10.12263/DZXB.20230246

摘要

在任意说话人语音转换中，训练阶段通常采用编码器对同一说话人语音进行解耦，然后用解码器进行自重构，而转换阶段的解码器是对源语音的内容信息与目标语音的个性特征进行耦合，因此解码器在转换阶段与训练阶段会存在性能失配现象，影响语音转换性能.对此提出了一种采用双重交换表示分离的语音转换方法DERS-VC（Double Exchange Representation Separation Voice Conversion）.该方法在训练阶段的自重构过程中，用同一说话人的语音模拟不同说话人的语音进行自监督训练.训练过程引入转换不变损失和周期循环一致损失，通过双重交换表示分离的循环过程使自重构语音与原始语音更加逼近.实验结果表明，DERS-VC算法在梅尔倒谱距离（Mel-Cepstral Distortion，MCD）上比现有的AGAIN-VC（Activation Guidance and Adaptive Instance Normalization Voice Conversion）转换方法平均降低了4.03%，平均意见分（Mean Opinion Score，MOS）提升了3.62%，转换语音质量和相似度都有提高.这说明，通过双重交换表示分离的方法可以更好地训练解码器，实现更好性能的任意说话人之间的语音转换.

Abstract

In any-to-any voice conversion

the encoder was usually utilized to disentangle the same speaker’s speech and then the decoder was used for self-reconstruction in the training phase

but the decoder in the conversion phase coupled the content information of source speech and the personality characteristics of target speech. Therefore

there existed performance mismatch between the decoder in the conversion phase and the training phase

which deteriorated the performance of voice conversion. This paper proposed a voice conversion method named DERS-VC (Double Exchange Representation Separation Voice Conversion) using double exchange representation separation. In self-reconstruction process of the training phase

the proposed method applied the speech of the same speaker to simulate the voice of different target speakers for self-supervised training. Meanwhile

the conversion invariance loss and the cycle consistency loss were introduced

and the cycle process of separation was conducted by double exchange representation separation to make the self-reconstructed speech closer to the original speech. The experimental results demonstrated that DERS-VC had an average reduction of 4.03% in MCD (Mel-Cepstral Distortion)

and had an increment of 3.62% in MOS (Mean Opinion Score)

compared with the AGAIN-VC (Activation Guidance and Adaptive Instance Normalization Voice Conversion) method

and the quality and similarity of converted speech both had been improved. This shows that the method of double exchange representation separation can decrease the mismatch of the decoder and improve the performance of any-to-any voice conversion.

关键词

Keywords

references

SISMAN B , YAMAGISHI J , KING S , et al . An overview of voice conversion and its challenges: From statistical modeling to deep learning [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2021 , 29 : 132 - 157 .

TANG H Z , ZHANG X L , WANG J Z , et al . AVQVC: One-shot voice conversion by vector quantization with applying contrastive learning [C ] // 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 4613 - 4617 .

徐宁 , 杨震 , 张玲华 . 基于状态空间模型的子频带语音转换算法 [J ] . 电子学报 , 2010 , 38 ( 3 ): 646 - 653 .

XU N , YANG Z , ZHANG L H . Sub-band voice morphing algorithm based on state-space model [J ] . Acta Electronica Sinica , 2010 , 38 ( 3 ): 646 - 653 . (in Chinese)

HELANDER E , SILEN H , VIRTANEN T , et al . Voice conversion using dynamic kernel partial least squares regression [J ] . IEEE Transactions on Audio, Speech, and Language Processing , 2012 , 20 ( 3 ): 806 - 817 .

WU Z Z , VIRTANEN T , CHNG E S , et al . Exemplar-based sparse representation with residual compensation for voice conversion [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2014 , 22 ( 10 ): 1506 - 1521 .

SUN L F , LI K , WANG H , et al . Phonetic posteriorgrams for many-to-one voice conversion without parallel data training [C ] // 2016 IEEE International Conference on Multimedia and Expo (ICME) . Piscataway : IEEE , 2016 : 1 - 6 .

HASHIMOTO T , SAITO D , MINEMATSU N . Many-to-many and completely parallel-data-free voice conversion based on eigenspace DNN [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2019 , 27 ( 2 ): 332 - 341 .

ALAA Y , ALFONSE M , AREF M M . A survey on generative adversarial networks based models for many-to-many non-parallel voice conversion [C ] // 2022 5th International Conference on Computing and Informatics (ICCI) . Piscataway : IEEE , 2022 : 221 - 226 .

KANEKO T , KAMEOKA H , TANAKA K , et al . Cyclegan-VC2: Improved cyclegan-based non-parallel voice conversion [C ] // ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2019 : 6820 - 6824 .

KAMEOKA H , KANEKO T , TANAKA K , et al . StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks [C ] // 2018 IEEE Spoken Language Technology Workshop (SLT) . Piscataway : IEEE , 2018 : 266 - 273 .

CHOU J C , LEE H Y . One-shot voice conversion by separating speaker and content representations with instance normalization [C ] // 20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019) . Piscataway : IEEE , 2019 : 664 - 668 .

QIAN K Z , ZHANG Y , CHANG S Y , et al . Autovc: Zero-shot voice style transfer with only autoencoder loss‍ [C ] // 36th International Conference on Machine Learning (ICML) . Piscataway : IEEE , 2019 : 5210 - 5219 .

CHEN Y H , WU D Y , WU T H , et al . Again-VC: A one-shot voice conversion using activation guidance and adaptive instance normalization [C ] // 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2021 : 5954 - 5958 .

WANG Q Q , ZHANG X L , WANG J Z , et al . DRVC: A framework of any-to-any voice conversion with self-supervised learning [C ] // 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 3184 - 3188 .

DANG T , TRAN D , CHIN P , et al . Training robust zero-shot voice conversion models with self-supervised features [C ] // 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 6557 - 6561 .

HUANG X , BELONGIE S . Arbitrary style transfer in real-time with adaptive instance normalization [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 1510 - 1519 .

WANG Y Y , SU J Q , FINKELSTEIN A , et al . Controllable speech representation learning via voice conversion and AIC loss [C ] // 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 6682 - 6686 .

WANG Z C , XIE Q C , LI T , et al . One-shot voice conversion for style transfer based on speaker adaptation [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 6792 - 6796 .

KANEKO T , KAMEOKA H , TANAKA K , et al . Maskcyclegan-VC: Learning non-parallel voice conversion with filling in frames [C ] // 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2021 : 5919 - 5923 .

SONG K , CONG J , WANG X S , et al . Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS‍ [C ] // 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) . Piscataway : IEEE , 2022 : 71 - 75 .

ZHAO X T , LIU F , SONG C H , et al . Disentangling content and fine-grained prosody information via hybrid ASR bottleneck features for voice conversion [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 7022 - 7026 .

LEE S H , NOH H R , NAM W J , et al . Duration controllable voice conversion via phoneme-based information bottleneck [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2022 , 30 : 1173 - 1183 .

王师琦 , 曾庆宁 , 龙超 , 等 . 语音增强与检测的多任务学习方法研究 [J ] . 计算机工程与应用 , 2021 , 57 ( 20 ): 197 - 202 .

WANG S Q , ZENG Q N , LONG C , et al . Multi-task learning for speech enhancement and detection [J ] . Computer Engineering and Applications , 2021 , 57 ( 20 ): 197 - 202 . (in Chinese)

仲伟峰 , 方祥 , 范存航 , 等 . 深浅层特征及模型融合的说话人识别 [J ] . 声学学报 , 2018 , 43 ( 2 ): 263 - 272 .

ZHONG W F , FANG X , FAN C H , et al . Fusion of deep shallow features and models for speaker recognition [J ] . Acta Acustica , 2018 , 43 ( 2 ): 263 - 272 . (in Chinese)

车滢霞 , 俞一彪 . 约束条件下的结构化高斯混合模型及非平行语料语音转换 [J ] . 电子学报 , 2016 , 44 ( 9 ): 2282 - 2288 .

CHE Y X , YU Y B . Non-parallel corpora voice conversion based on structured Gaussian mixture model under constraint conditions [J ] . Acta Electronica Sinica , 2016 , 44 ( 9 ): 2282 - 2288 . (in Chinese)

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Adversarial Learning and Enhanced Optimization Based Restoration Method for VC-Generated Speeches

Non-parallel Corpora Voice Conversion Based on Structured Gaussian Mixture Model Under Constraint Conditions

Voice Conversion for Enhancing Mandarin Electro-Laryngeal Speech Based on Semantic Information

Identification and Migration of Silent Security Patches in Blockchain Systems via Information Fusion

Broadband Wide-Beamwidth Phased Array System

Related Author

YUE Feng

WANG Nian-song

LIAN Chen-si

ZHANG Guo-fu

ZHOU Xiao-lin

SU Zhao-pin

YU Yi-biao

CHE Ying-xia

Related Institution

School of Computer and Information Technology, Hefei University of Technology

Department of Physical Evidence Identification, Anhui Public Security Department

Intelligent Interconnected Systems Laboratory of Anhui Province (Hefei University of Technology)

Joint Laboratory of Intelligent Prevention and Recognition of Audio and Video

School of Electronic and Information Engineering, Soochow University

⁰