A Discriminative Segmental Feature Transform Method Based on Uncorrelated Matching Pursuit
CHEN Bin1,2, NIU Tong1, ZHANG Lian-hai1, QU Dan1, LI Bi-cheng1
1. Institute of Information System Engineering, Information Engineering University, Zhengzhou, Henan 450001, China;
2. Shanghai Branch of Southwest Electronics and Telecommunication Technology Research Institute, Shanghai 200434, China
A discriminative segmental feature transform method is proposed to promote the stability of the frame based method.The feature transform is considered as the sparse high dimensional approximation problem.Firstly,a set of feature transform matrices are estimated by tied-state based training of RDLT (Region Dependent Linear Transform) and m-fMPE (mean-offset feature Minimum Phone Error),and the transform matrices are integrated into an over-complete dictionary.Then,the speech signal is segmented through force alignment.Finally,following the matching pursuit to optimize the likelihood objective function iteratively,the transform matrices of each segment are selected from the dictionary and the corresponding coefficients are automatic determined in the optimization process.Further,to guarantee the stability of the transform matrices,a correlation measurement is introduced to remove the correlated basis in the recurrence process.The experimental results show that,compared with the traditional RDLT method,when the acoustic model is trained with maximum likelihood and discriminative training criterion separately,the recognition performance can be improved by 1.63% and 2.23% respectively.The method can also be applied to speech enhancement and model discriminative training.
[1] Abbasian H,Nasersharif B,Akbari A,et al.Optimized linear discriminant analysis for extracting robust speech features[A].Proceedings of International Symposium Communication Control and Signal Processing[C].Julians,Malta:IEEE,2008.819-824.
[2] Nasersharif B,Akbari A.SNR-dependent compression of enhanced Mel sub-band energies for compensation of noise effects on MFCC features[J].Pattern Recognition Letters,2011,28(11),1320-1326.
[3] Povey D,Kingsbury B,Mangu L,et al.fMPE:Discriminatively trained features for speech recognition[A].Proceedings of the International Conference on Audio,Speech and Signal Processing[C].Philadelphia,United States:IEEE,2005.961-964.
[4] Zhang B,Matsoukas S,Schwartz R.Discriminatively trained region dependent feature transforms for speech recognition[A].Proceedings of the International Conference on Audio,Speech and Signal Processing[C].Toulouse,France:IEEE,2006.313-316.
[5] Zhang B,Matsoukas S,Schwartz R.Recent progress on the discriminative region-dependent transform for speech feature extraction[A].Proceedings of the Annual Conference of International Speech Communication Association[C].Pittsburgh,United States:ISCA,2006.1495-1498.
[6] Takashi F,Osamu I,Masafumi N,et al.Regularized feature-space discriminative adaptation for robust ASR[A].Proceedings of the Annual Conference of International Speech Communication Association[C].Singapore:ISCA,2014.2185-2188.
[7] Povey D.Improvements to fMPE for discriminative training of features[A].Proceedings of the Annual Conference of International Speech Communication Association[C].Lisbon,Portugal:ISCA,2005.2977-2980.
[8] Yan Z J,Huo Q,Xu J,et al.Tied-state based discriminative training of context-expanded region-dependent feature transforms for LVCSR[A].Proceedings of the International Conference on Audio,Speech and Signal Processing[C].Vancouver,Canada:IEEE,2013.6940-6944.
[9] Deng L,Chen J S.Sequence classification using the high-level features extracted from deep neural networks[A].Proceedings of the International Conference on Audio,Speech and Signal Processing[C].Florence,Italy:IEEE,2014.6894-6898.
[10] Ling Z H,Kang S Y,Zen H,et al.Deep learning for acoustic modeling in parametric speech generation:a systematic review of existing techniques and future trends[J].IEEE Signal Processing Magazine,2015,32(3):35-52.
[11] George S,Brian K.Discriminative feature-space transforms using deep neural networks[A].Proceedings of the Annual Conference of International Speech Communication Association[C].Oregon,United States:ISCA,2012.
[12] Paulik M.Lattice-based training of bottleneck feature extraction neural networks[A].Proceedings of the Annual Conference of International Speech Communication Association[C].Lyon,France:ISCA,2013.89-93.
[13] Liu D Y,Wei S,Guo W,et al.Lattice based optimization of bottleneck feature extractor with linear transformation[A].Proceedings of the International Conference on Audio,Speech and Signal Processing[C].Florence,Italy:IEEE,2014.5617-5621.
[14] Kuhn R,Junqua J C,Nguyen P,et al.Rapid speaker adaptation in eigenvoice space[J].IEEE Transactions on Speech and Audio Processing,2000,8(6):695-707.
[15] Ghoshal A,Povey D,Agarwal M,et al.A novel estimation of feature-space MLLR for full-covariance models[A].Proceedings of International Conference on Acoustics,Speech and Signal Processing[C].Texas,USA:IEEE,2010.4310-4313.
[16] Mallat S G,Zhang Z.Matching pursuits with time-frequency dictionaries[J].IEEE Transactions on Signal Processing,1993,41(12):3397-3415.
[17] Tropp J A,Gilbert A C.Signal recovery from random measurement via orthogonal matching pursuit[J].IEEE Transactions on Information Theory,2007,53(12):4655-4666.
[18] Needell D,Vershynin R.Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit[J].IEEE Journal of Selected Topics Signal Processing,2009,4(2):310-316.