

浏览全部资源
扫码关注微信
哈尔滨理工大学测控技术与通信工程学院,黑龙江哈尔滨 150080
Received:07 June 2021,
Revised:2022-01-30,
Published:25 April 2023
移动端阅览
兰朝凤,王顺博,郭小霞等.基于DCNN和BiLSTM的单通道视听融合语音分离方法研究[J].电子学报,2023,51(04):914-921.
LAN Chao-feng,WANG Shun-bo,GUO Xiao-xia,et al.A Single Channel Audio-Visual Fusion Speech Separation Method Based on DCNN and BiLSTM[J].ACTA ELECTRONICA SINICA,2023,51(04):914-921.
兰朝凤,王顺博,郭小霞等.基于DCNN和BiLSTM的单通道视听融合语音分离方法研究[J].电子学报,2023,51(04):914-921. DOI: 10.12263/DZXB.20210726.
LAN Chao-feng,WANG Shun-bo,GUO Xiao-xia,et al.A Single Channel Audio-Visual Fusion Speech Separation Method Based on DCNN and BiLSTM[J].ACTA ELECTRONICA SINICA,2023,51(04):914-921. DOI: 10.12263/DZXB.20210726.
近年来,随着语音处理及计算机技术的飞速发展,人机语音交互的重要性日益突出.其中,语音分离是将目标语音从混合语音中分离出来的一项重要任务.然而,在著名的“鸡尾酒会”等复杂开放环境下语音的分离远没有达到令人满意的效果.针对现实生活中多说话人交流场景,本文以空洞卷积(Dilated Convolutions Neural Network,DCNN)和双向长短时记忆(Bi-directional Long Short-Term Memory,BiLSTM)为网络基础,提出一种视听融合的语音分离(DCNN-BiLSTM)模型.该模型在训练过程中通过音频编号查找与之对应的视觉信息,视觉信息可以将音频聚焦在说话场景中该说话人上,以达到增强语音分离效果.在AVSpeech数据集上进行实验测试,利用PESQ(Perceptual Evaluation of Speech Quality)、STOI(Short-Time Objective Intelligibility)和SDR(Signal-to-Distortion Ratio)指标评价分离效果.研究表明,本文方法比经典的AVSpeech分离方法在语音分离能力上提高了3.37 dB.
In recent years
with the rapid development of speech processing and computer technology
it is becoming more and more prominent that the importance of human-computer speech interaction. Among them
speech separation is an important task to separate target speech from mixed speech. However
in the famous “Cocktail Party” and other complex open environment
the separation of speech is far from achieving satisfactory results. For the multi-speaker scenarios in real life
this paper is based on dilated convolutions neural network and bi-directional long short-term memory network
and presents an audio-visual fusion speech separation model DCNN-BiLSTM. In the training process
the model searches for the corresponding visual information through the audio number
and the visual information can focus the audio on the speaker in the speaking scene to enhance separation effect. Experimental tests are carried out on the AVSpeechs datasets
and the separation effect is evaluated by using PESQ
STOI and SDR indexes. The results show that the proposed method improves the speech separation ability by 3.73 dB compared with the traditional speech separation method.
BELL A J , SEJNOWSKI T J . An information-maximization approach to blind separation and blind deconvolution [J ] . Neural Computation , 1995 , 7 ( 6 ): 1129 - 1159 .
葛宛营 , 张天骐 , 范聪聪 , 等 . 噪声情况下采用稀疏非负矩阵分解与深度吸引子网络的人声分离算法 [J ] . 声学学报 , 2021 , 46 ( 1 ): 55 - 66 .
GE W Y , ZHANG T Q , FAN C C , et al . Monaural noisy speech separation combining sparse non-negative matrix factorization and deep attractor network [J ] . Acta Acustica , 2021 , 46 ( 1 ): 55 - 66 . (in Chinese)
朱阁 . 基于深度学习的单通道语音分离技术研究 [D ] . 南京 : 南京邮电大学 , 2020 .
ZHU G . Research on Single-Channel Speech Separation Technology Based on Deep Learning [D ] . Nanjing : Nanjing University of Posts and Telecommunications , 2020 . (in Chinese)
CHEN J J , MAO Q R , QIN Y C , et al . Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder [J ] . Frontiers of Information Technology & Electronic Engineering , 2020 , 21 ( 11 ): 1639 - 1650 .
WANG D L , BROWN G S . Computational Auditory Scene Analysis: Principles, Algorithms, and Applications [M ] . Hoboken : Wiley Interscience , 2006 .
SCHMIDT M N , OLSSON R K . Single-channel speech separation using sparse non-negative matrix factorization [C ] // Proceedings of Interspeech 2006 - Ninth International Conference on Spoken Language Processing . Pittsburgh : ISCA , 2006 : 1652 - 1661 .
ZHOU W L , ZHU Z , LIANG P Y . Speech denoising using Bayesian NMF with online base update [J ] . Multimedia Tools and Applications , 2019 , 78 ( 11 ): 15647 - 15664 .
SUN L , DU J , DAI L R , et al . Multiple-target deep learning for LSTM-RNN based speech enhancement [C ] // 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA) . San Francisco : IEEE , 2017 : 136 - 140 .
KOLBÆK M , TAN Z H , JENSEN J . Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2017 , 25 ( 1 ): 153 - 167 .
SALEEM N , KHATTAK M I , QAZI A B . Supervised speech enhancement based on deep neural network [J ] . Journal of Intelligent & Fuzzy Systems , 2019 , 37 ( 4 ): 5187 - 5201 .
SALEEM N , KHATTAK M I , ALI M , et al . Deep neural network for supervised single-channel speech enhancement [J ] . Archives of Acoustics , 2019 , 44 : 3 - 12 .
ZÃO L , COELHO R , FLANDRIN P . Speech enhancement with EMD and Hurst-based mode selection [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2014 , 22 ( 5 ): 899 - 911 .
WILLIAMSON D S , WANG Y X , WANG D L . Complex ratio masking for monaural speech separation [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2016 , 24 ( 3 ): 483 - 492 .
ISIK Y , ROUX J L , CHEN Z , et al . Single-channel multi-speaker separation using deep clustering [EB/OL ] . ( 2016 )[2021 ] . https://arxiv.org/abs/1607.02173 https://arxiv.org/abs/1607.02173 .
FENG W J , GUAN N Y , LI Y , et al . Audio visual speech recognition with multimodal recurrent neural networks [C ] // 2017 International Joint Conference on Neural Networks (IJCNN) . Anchorage : IEEE , 2017 : 681 - 688 .
MROUEH Y , MARCHERET E , GOEL V . Deep multimodal learning for audio-visual speech recognition [C ] // 2015 IEEE International Conference on Acoustics, Speech and Signal Processing . South Brisbane : IEEE , 2015 : 2130 - 2134 .
NGIAM J , KHOSLA A , KIM M , et al . Multimodal deep learning [C ] // Proceedings of the 28th International Conference on Machine Learning . Washington : DBLP , 2009 : 1 - 9 .
CHUNG J S , SENIOR A , VINYALS O , et al . Lip reading sentences in the wild [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu : IEEE , 2017 : 3444 - 3453 .
EPHRAT A , HALPERIN T , PELEG S . Improved speech reconstruction from silent video [C ] // 2017 IEEE International Conference on Computer Vision Workshops . Venice : IEEE , 2017 : 455 - 462 .
颜霖煌 . 基于图像边缘保持滤波技术的语音增强算法研究 [D ] . 广州 : 广州大学 , 2020 .
YAN L H . Research on Speech Enhancement Algorithm Based on Image Edge Preserving Filter [D ] . Guangzhou : Guangzhou University , 2020 . (in Chinese)
GABBAY A , EPHRAT A , HALPERIN T , et al . Seeing through noise: Visually driven speaker separation and enhancement [C ] // 2018 IEEE International Conference on Acoustics, Speech and Signal Processing . Calgary : IEEE , 2018 : 3051 - 3055 .
TAN K , XU Y , ZHANG S X , et al . Audio-visual speech separation and dereverberation with a two-stage multimodal network [J ] . IEEE Journal of Selected Topics in Signal Processing , 2020 , 14 ( 3 ): 542 - 553 .
GU R Z , ZHANG S X , XU Y , et al . Multi-modal multi-channel target speech separation [J ] . IEEE Journal of Selected Topics in Signal Processing , 2020 , 14 ( 3 ): 530 - 541 .
LLAGOSTERA CASANOVAS A , MONACI G , VANDERGHEYNST P , et al . Blind audiovisual source separation based on sparse redundant representations [J ] . IEEE Transactions on Multimedia , 2010 , 12 ( 5 ): 358 - 371 .
HOU J C , WANG S S , LAI Y H , et al . Audio-visual speech enhancement using multimodal deep convolutional neural networks [J ] . IEEE Transactions on Emerging Topics in Computational Intelligence , 2018 , 2 ( 2 ): 117 - 128 .
OUYANG Z H , YU H J , ZHU W P , et al . A fully convolutional neural network for complex spectrogram processing in speech enhancement [C ] // 2019 IEEE International Conference on Acoustics, Speech and Signal Processing . Brighton : IEEE , 2019 : 5756 - 5760 .
TORFI A , IRANMANESH S M , NASRABADI N , et al . 3D convolutional neural networks for cross audio-visual matching recognition [J ] . IEEE Access , 2017 , 5 : 22081 - 22091 .
EPHRAT A , MOSSERI I , LANG O , et al . Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation [J ] . ACM Transactions on Graphics , 2018 , 37 ( 4 ): 112 .
0
Views
15
下载量
1
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621