电子学报 ›› 2016, Vol. 44 ›› Issue (9): 2282-2288.DOI: 10.3969/j.issn.0372-2112.2016.09.037

• 科研通信 • 上一篇    下一篇

约束条件下的结构化高斯混合模型及非平行语料语音转换

车滢霞, 俞一彪   

  1. 苏州大学电子信息学院, 江苏苏州 215006
  • 收稿日期:2015-02-08 修回日期:2015-08-10 出版日期:2016-09-25
    • 通讯作者:
    • 俞一彪
    • 作者简介:
    • 车滢霞 女,1989年生,江苏常州人,苏州大学电子信息学院硕士,研究方向为语音信号处理.
    • 基金资助:
    • 国家自然科学基金 (No.61271360); 江苏省自然科学基金 (No.BK20131196)

Non-parallel Corpora Voice Conversion Based on Structured Gaussian Mixture Model Under Constraint Conditions

CHE Ying-xia, YU Yi-biao   

  1. School of Electronic and Information Engineering, Soochow University, Suzhou, Jiangsu 215006, China
  • Received:2015-02-08 Revised:2015-08-10 Online:2016-09-25 Published:2016-09-25
    • Supported by:
    • National Natural Science Foundation of China (No.61271360); Natural Science Foundation of Jiangsu Province,  China (No.BK20131196)

摘要:

提出一种约束条件下的结构化高斯混合模型及非平行语料语音转换方法.从源与目标说话人的原始非平行语料中提取出少量相同音节,在结构化高斯混合模型的训练过程中,利用这些相同音节包含的语义信息及声学特征对应关系对K均值聚类中心进行约束,并在(Expectation Maximum,EM)迭代过程中对语音帧属于模型分量的后验概率进行修正,得到基于约束的结构化高斯混合模型(Structured Gaussian Mixture Model with Constraint condition,C-SGMM).再利用全局声学结构(Acoustic Universal Structure,AUS)原理对源和目标说话人的约束结构化高斯混合模型的高斯分布进行匹配对准,推导出短时谱转换函数.主观和客观评价实验结果表明,使用该方法得到的转换后语音在谱失真,目标倾向性和语音质量等方面均优于传统的结构化模型语音转换方法,转换语音的平均谱失真仅为0.52,说话人正确识别率达到95.25%,目标语音倾向性指标ABX平均为0.82,性能更加接近于基于平行语料的语音转换方法.

关键词: 语音转换, 结构化高斯混合模型, 非平行语料, 约束条件

Abstract:

This paper proposes a structured Gaussian mixture model with constraint conditions (C-SGMM) for non-parallel corpora voice conversion.A small number of voice signals with the same syllables from the source and target non-parallel corpus are extracted as constraint conditions,then the correspondence between acoustic features of source and target corpus formed by these syllables are applied in the process of statistical acoustic model training.The constraint conditions are used to restrict the cluster centers of K-means clustering process,and they are also used in EM algorithm to adjust the voice frame's posterior probability belonging to a Gaussian distribution component for model training.Then Gaussian distributions in source and target structured Gaussian mixture models are aligned using acoustic universal structure principle and the conversion function can be derived.Results of both subjective and objective experiments indicate that the conversion performance obtained by the proposed method are advanced to that of the traditional structured method in cepstrum distortion,target tendency and speech quality aspects.The average cepstrum distortion of converted speech is only 0.52,the speaker recognition rate of the converted speech reaches 95.25%,and the performance closer to the conventional parallel corpora GMM based method is achieved.

Key words: voice conversion, structure Gaussian mixture model, non-parallel corpora, constraint conditions

中图分类号: