Abstract:In the speech enhancement of the time-frequency domain,both the amplitude estimation and the phase estimation are the important factors that affect speech enhancement performance.In order to incorporate the phase estimation into the speech enhancement approaches based on deep learning,the real and imaginary part of the short-time Fourier transform (STFT) of noisy speech are treated as two channels and fed into the deep convolutional neural network (DCNN) in this paper.By establishing a multi-task learning model which simultaneously estimates the real and imaginary part of the STFT of clean speech,the synchronous estimation of the amplitude and phase is achieved.Experimental results show that compared with the approaches only considering the amplitude estimation,the proposed approach has better noise suppression ability,and improves speech enhancement performance significantly under the condition of low SNR.
[1] 刘文举,聂帅,梁山,等.基于深度学习语音分离技术的研究现状与进展[J].自动化学报,2016,42(6):819-833. LIU Wen-Ju,NIE Shuai,LIANG Shan,ZHANG Xue-Liang.Deep learning based speech separation technology and its developments[J].Acta Automatica Sinica,2016,42(6):819-833.(in Chinese)
[2] 孟宪波,鲍长春.基于最小控制GARCH模型的噪声估计算法[J].电子学报,2016,44(3):747-752. MENG Xian-bo,BAO Chang-chun.Noise estimate algorithm based on minima controlled GARCH model[J].Acta Electronica Sinica,2016,44(3):747-752.(in Chinese)
[3] 何玉文,鲍长春,夏丙寅,等.基于AR-HMM在线能量调整的语音增强方法[J].电子学报,2014,42(10):1991-1997. HE Yu-wen,BAO Chang-chun,XIA Bing-yin,et al.Online energy adjustment using AR-HMM for speech enhancement[J].Acta Electronica Sinica,2014,42(10):1991-1997.(in Chinese)
[4] WANG Y,WANG D L.Towards scaling up classification-based speech separation[J].IEEE Transactions on Audio,Speech,and Language Processing,2013,21(7):1381-1390.
[5] WANG Y,NARAYANAN A,WANG D L.On training targets for supervised speech separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2014,22(12):1849-1858.
[6] XU Y,DU J,DAI L R,et al.An experimental study on speech enhancement based on deep neural networks[J].IEEE Signal Processing Letters,2014,21(1):65-68.
[7] XU Y,DU J,DAI L R,et al.A regression approach to speech enhancement based on deep neural networks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2015,23(1):7-19.
[8] 韩伟,张雄伟,闵刚,等.基于感知掩蔽深度神经网络的单通道语音增强方法[J].自动化学报,2017,43(2):248-258. HAN Wei,ZHANG Xiong-Wei,MIN Gang,et al.A single-channel speech enhancement approach based on perceptual masking deep neural network[J].Acta Automatica Sinica,2017,43(2):248-258.(in Chinese)
[9] PALIWAL K,WÓJCICKI K,SHANNON B.The importance of phase in speech enhancement[J].Speech Communication,2011,53(4):465-494.
[10] WILLIAMSON D S,WANG Y,WANG D L.Complex ratio masking for monaural speech separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2016,24(3):483-492.
[11] WENINGER F,ERDOGAN H,WATANABE S,et al.Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR[A].Proceedings of International Conference on Latent Variable Analysis and Signal Separation[C].Liberec:Springer International Publishing,2015.91-99.
[12] MAYER F,WILLIAMSON D S,MOWLAEE P,et al.Impact of phase estimation on single-channel speech separation based on time-frequency masking[J].The Journal of the Acoustical Society of America,2017,141(6):4668-4679.
[13] KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenet classification with deep convolutional neural networks[A].Proceedings of the International Conference on Neural Information Processing Systems[C].Nevada:Curran Associates Inc,2012.1097-1105.
[14] 柯圣财,赵永威,李弼程,彭天强.基于卷积神经网络和监督核哈希的图像检索方法[J].电子学报,2017,45(1):157-163. KE Sheng-cai,ZHAO Yong-wei,LI Bi-cheng,PENG Tian-qiang.Image retrieval based on convolutional neural network and kernel-based supervised hashing[J].Acta Electronica Sinica,2017,45(1):157-163.(in Chinese)
[15] GAROFOLO J S,LAMEL L F,FISHER W M,et al.TIMIT Acoustic-Phonetic Continuous Speech Corpus[CD].Philadelphia:Linguistic Data Consortium,1993.
[16] HU G."100 Nonspeech Environmental Sounds,2004"[OL]. http://web.cse.ohio-state.edu/pnl/corpus/HuNo-nspeech/HuCorpus.html,2004.
[17] VARGA A,STEENEKEN H J M.Assessment for automatic speech recognition:Ⅱ.NOISEX-92:A database and an experiment to study the effect of additive noise on speech recognition systems[J].Speech Communication,1993,12(3):247-251.
[18] RIX A W,BEERENDS J G,HOLLIER M P,et al.Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs[A].Proceedings of IEEE International Conference on Acoustics,Speech,and Signal Processing[C].Utah:IEEE,2001.749-752.
[19] TAAL C H,HENDRIKS R C,HEUSDENS R,et al.An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J].IEEE Transactions on Audio,Speech,and Language Processing,2011,19(7):2125-2136.
[20] VINCENT E,GRIBONVAL R,FEVOTTE C.Performance measurement in blind audio source separation[J].IEEE Transactions on Audio,Speech,and Language Processing,2006,14(4):1462-1469.
[21] YU D,EVERSOLE A,SELTZER M,et al.An Introduction to Computational Networks and the Computational Network Toolkit[R].Tech.Rep.MSR,Microsoft Research,2014.