Identity Vector Extraction Using Shared Mixture of PLDA for Short-Time Speaker Recognition
WANG Wenchao1,2, XU Ji1,2, YAN Yonghong1,2,3
1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China; 2. University of Chinese Academy of Sciences, Beijing 100049, China; 3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumchi 830011, China
Identity Vector Extraction Using Shared Mixture of PLDA for Short-Time Speaker Recognition
WANG Wenchao1,2, XU Ji1,2, YAN Yonghong1,2,3
1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China; 2. University of Chinese Academy of Sciences, Beijing 100049, China; 3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumchi 830011, China
摘要 The state-of-the-art speaker recognition system degrades performance rapidly dealing with shorttime utterances. It is known to all that identity vectors (i-vectors) extracted from short utterances have large uncertainties and standard Probabilistic linear discriminant analysis (PLDA) method can not exploit this uncertainty to reduce the effect of duration variation. In this work, we use Shared mixture of PLDA (SM-PLDA) to remodel the i-vectors utilizing their uncertainties. SM-PLDA is an improved generative model with a shared intrinsic factor, and this factor can be regarded as an identity vector containing speaker indentification information. This identity vector can be modeled by PLDA. Experimental results are evaluated by both equal error rate and minimum detection cost function. The results conducted on the National institute of standards and technology (NIST) Speaker recognition evaluation (SRE) 2010 extended tasks show that the proposed method has achieved significant improvements compared with ivector/PLDA and some other advanced methods.
Abstract:The state-of-the-art speaker recognition system degrades performance rapidly dealing with shorttime utterances. It is known to all that identity vectors (i-vectors) extracted from short utterances have large uncertainties and standard Probabilistic linear discriminant analysis (PLDA) method can not exploit this uncertainty to reduce the effect of duration variation. In this work, we use Shared mixture of PLDA (SM-PLDA) to remodel the i-vectors utilizing their uncertainties. SM-PLDA is an improved generative model with a shared intrinsic factor, and this factor can be regarded as an identity vector containing speaker indentification information. This identity vector can be modeled by PLDA. Experimental results are evaluated by both equal error rate and minimum detection cost function. The results conducted on the National institute of standards and technology (NIST) Speaker recognition evaluation (SRE) 2010 extended tasks show that the proposed method has achieved significant improvements compared with ivector/PLDA and some other advanced methods.
基金资助:This work is partially supported by the National Natural Science Foundation of China (No.11590770-4, No.U1536117, No.11504406, No.11461141004), the National Key Research and Development Plan (No.2016YFB0801203, No.2016YFB0801200), the Key Science and Technology Project of the Xinjiang Uygur Autonomous Region (No.2016A03007-1),and the Pre-research Project for Equipment of General Information System (No.JZX2017-0994/Y306).
作者简介: WANG Wenchao was born in 1991. He received the B.E. degree in information engineering from Xidian University, China, in 2014. He is a Ph.D. candidate of Institute of Acoustics, Chinese Academy of Sciences. His research interests include machine learning and robust speaker recognition. (Email:wangwenchao@hccl.ioa.ac.cn)
引用本文:
WANG Wenchao, XU Ji, YAN Yonghong. Identity Vector Extraction Using Shared Mixture of PLDA for Short-Time Speaker Recognition[J]. 电子学报, 2019, 28(2): 357-363.
WANG Wenchao, XU Ji, YAN Yonghong. Identity Vector Extraction Using Shared Mixture of PLDA for Short-Time Speaker Recognition. Chinese Journal of Electronics, 2019, 28(2): 357-363.
[1] N. Dehak, P.J. Kenny, R. Dehak, et al., “Front-end factor analysis for speaker verification”, IEEE Transactions on Audio, Speech, and Language Processing, Vol.19, No.4, pp.788-798, 2011. [2] S.J.D. Prince and J.H. Elder, “Probabilistic linear discriminant analysis for inferences about identity”, Proc. of IEEE 11th International Conference on Computer Vision, pp.1-8, 2007. [3] D.A. Reynolds, T.F. Quatieri and R.B. Dunn, “Speaker verification using adapted gaussian mixture models”, Digital signal processing, Vol. 10, No.1, pp.19-41, 2000. [4] P. Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms”, CRIM, Report, CRIM-06/08-13, 2005. [5] N. Dehak, Z.N. Karam, D.A. Reynolds, et al., “A channelblind system for speaker verification”, Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Republic, pp.4536-4539, 2011. [6] P. Kenny, “Bayesian speaker verification with heavy-tailed prior”, Proc. of The Speaker and Language Recognition Workshop, Brno, Czech Republic, 2010. [7] L.T. Xu, Z. Yang and L. Sun, “Simplification of I-Vector Extraction for Speaker Identification”, Chinese Journal of Electronics, Vol.25, No.6, pp.1121-1126, 2016. [8] Y.F. Xu, H. Yang, L. Yang, et al., “A general Bayesian model for speaker verification”, Chinese Journal of Electronics, Vol.25, No.6, pp.1045-1051, 2016. [9] Y. Lei, N. Scheffer, L. Ferrer, et al., “A novel scheme for speaker recognition using a phonetically-aware deep neural network”, Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, pp.1695-1699, 2014. [10] P. Kenny, T. Stafylakis, P. Ouellet, et al., “Plda for speaker verification with utterances of arbitrary duration”, Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, pp.7649-7653, 2013. [11] S. Cumani, O. Plchot and P. Laface, “Probabilistic linear discriminant analysis of i-vector posterior distributions”, Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, pp.7644-7648, 2013. [12] S. Cumani, “Fast scoring of full posterior plda models”, IEEE Transactions on Audio, Speech, and Language Processing, Vol.23, No.11, pp.2036-2045, 2015. [13] S. Cumani, O. Plchot and P. Laface, “On the use of i-vector posterior distributions in probabilistic linear discriminant analysis”, IEEE Transactions on Audio, Speech, and Language Processing, Vol.22, No.4, pp.846-857, 2014. [14] Q.Y. Hong, L. Li, M. Li, et al., “Modified-prior plda and score calibration for duration mismatch compensation in speaker recognition system”, Proc. of Conference of the International Speech Communication Association (INTERSPEECH), Dresden, Germany, pp.1037-1041, 2015. [15] M.I. Mandasari, R. Saeidi, M. McLaren, et al., “Quality measure functions for calibration of speaker recognition systems in various duration conditions”, IEEE Transactions on Audio, Speech, and Language Processing, Vol.21, No.11, pp.2425-2438, 2013. [16] M.I. Mandasari, R. Saeidi and D.A.V. Leeuwen, “Quality measures based calibration with duration and noise dependency for speaker recognition”, Speech Communication, Vol.72, pp.126-137, 2015. [17] Z. Ghahramani and G.E. Hinton, “The EM algorithm for mixtures of factor analyzers”, Technical Report, CRG-TR-96-1, 1996. [18] M. Senoussaoui, P. Kenny, N. Brummer, et al., “Mixture of plda models in i-vector space for gender-independent speaker recognition”, Proc. of Conference of the International Speech Communication Association (INTERSPEECH), Florence, Italy, pp.25-28, 2011. [19] A.P. Dempster, N.M. Laird and D.B. Rubin, “Maximum likelihood from incomplete data via the em algorithm”, Journal of the Royal Statistical Society, Vol.39, No.1, pp.1-38, 1977. [20] D. Garcia-Romero and C.Y. Espy-Wilson, “Analysis of ivector length normalization in speaker recognition systems”, Proc. of Conference of the International Speech Communication Association (INTERSPEECH), Florence, Italy, pp.249-252, 2011.