A Single Channel Audio-Visual Fusion Speech Separation Method Based on DCNN and BiLSTM

LAN Chao-feng; WANG Shun-bo; GUO Xiao-xia; HAN Yu-lan; KANG Shou-qiang

doi:10.12263/DZXB.20210726

您当前的位置：

首页 >

文章列表页 >

A Single Channel Audio-Visual Fusion Speech Separation Method Based on DCNN and BiLSTM

PAPERS | 更新时间：2025-12-08

- A Single Channel Audio-Visual Fusion Speech Separation Method Based on DCNN and BiLSTM
- ACTA ELECTRONICA SINICA Vol. 51, Issue 4, Pages: 914-921(2023)
- 作者机构：
  
  哈尔滨理工大学测控技术与通信工程学院，黑龙江哈尔滨 150080
- 作者简介：
- 基金信息：
- DOI：10.12263/DZXB.20210726
  CLC： TP391.9;
- Received：07 June 2021，
  
  Revised：2022-01-30，
  
  Published：25 April 2023
- 稿件说明：
移动端阅览
兰朝凤,王顺博,郭小霞等.基于DCNN和BiLSTM的单通道视听融合语音分离方法研究[J].电子学报,2023,51(04):914-921.

LAN Chao-feng,WANG Shun-bo,GUO Xiao-xia,et al.A Single Channel Audio-Visual Fusion Speech Separation Method Based on DCNN and BiLSTM[J].ACTA ELECTRONICA SINICA,2023,51(04):914-921.
兰朝凤,王顺博,郭小霞等.基于DCNN和BiLSTM的单通道视听融合语音分离方法研究[J].电子学报,2023,51(04):914-921. DOI： 10.12263/DZXB.20210726.

LAN Chao-feng,WANG Shun-bo,GUO Xiao-xia,et al.A Single Channel Audio-Visual Fusion Speech Separation Method Based on DCNN and BiLSTM[J].ACTA ELECTRONICA SINICA,2023,51(04):914-921. DOI： 10.12263/DZXB.20210726.

摘要

近年来，随着语音处理及计算机技术的飞速发展，人机语音交互的重要性日益突出.其中，语音分离是将目标语音从混合语音中分离出来的一项重要任务.然而，在著名的“鸡尾酒会”等复杂开放环境下语音的分离远没有达到令人满意的效果.针对现实生活中多说话人交流场景，本文以空洞卷积（Dilated Convolutions Neural Network，DCNN）和双向长短时记忆（Bi-directional Long Short-Term Memory，BiLSTM）为网络基础，提出一种视听融合的语音分离（DCNN-BiLSTM）模型.该模型在训练过程中通过音频编号查找与之对应的视觉信息，视觉信息可以将音频聚焦在说话场景中该说话人上，以达到增强语音分离效果.在AVSpeech数据集上进行实验测试，利用PESQ（Perceptual Evaluation of Speech Quality）、STOI（Short-Time Objective Intelligibility）和SDR（Signal-to-Distortion Ratio）指标评价分离效果.研究表明，本文方法比经典的AVSpeech分离方法在语音分离能力上提高了3.37 dB.

Abstract

In recent years

with the rapid development of speech processing and computer technology

it is becoming more and more prominent that the importance of human-computer speech interaction. Among them

speech separation is an important task to separate target speech from mixed speech. However

in the famous “Cocktail Party” and other complex open environment

the separation of speech is far from achieving satisfactory results. For the multi-speaker scenarios in real life

this paper is based on dilated convolutions neural network and bi-directional long short-term memory network

and presents an audio-visual fusion speech separation model DCNN-BiLSTM. In the training process

the model searches for the corresponding visual information through the audio number

and the visual information can focus the audio on the speaker in the speaking scene to enhance separation effect. Experimental tests are carried out on the AVSpeechs datasets

and the separation effect is evaluated by using PESQ

STOI and SDR indexes. The results show that the proposed method improves the speech separation ability by 3.73 dB compared with the traditional speech separation method.

关键词

Keywords

references

BELL A J , SEJNOWSKI T J . An information-maximization approach to blind separation and blind deconvolution [J ] . Neural Computation , 1995 , 7 ( 6 ): 1129 - 1159 .

葛宛营 , 张天骐 , 范聪聪 , 等 . 噪声情况下采用稀疏非负矩阵分解与深度吸引子网络的人声分离算法 [J ] . 声学学报 , 2021 , 46 ( 1 ): 55 - 66 .

GE W Y , ZHANG T Q , FAN C C , et al . Monaural noisy speech separation combining sparse non-negative matrix factorization and deep attractor network [J ] . Acta Acustica , 2021 , 46 ( 1 ): 55 - 66 . (in Chinese)

朱阁 . 基于深度学习的单通道语音分离技术研究 [D ] . 南京 : 南京邮电大学 , 2020 .

ZHU G . Research on Single-Channel Speech Separation Technology Based on Deep Learning [D ] . Nanjing : Nanjing University of Posts and Telecommunications , 2020 . (in Chinese)

CHEN J J , MAO Q R , QIN Y C , et al . Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder [J ] . Frontiers of Information Technology & Electronic Engineering , 2020 , 21 ( 11 ): 1639 - 1650 .

WANG D L , BROWN G S . Computational Auditory Scene Analysis: Principles, Algorithms, and Applications [M ] . Hoboken : Wiley Interscience , 2006 .

SCHMIDT M N , OLSSON R K . Single-channel speech separation using sparse non-negative matrix factorization [C ] // Proceedings of Interspeech 2006 - Ninth International Conference on Spoken Language Processing . Pittsburgh : ISCA , 2006 : 1652 - 1661 .

ZHOU W L , ZHU Z , LIANG P Y . Speech denoising using Bayesian NMF with online base update [J ] . Multimedia Tools and Applications , 2019 , 78 ( 11 ): 15647 - 15664 .

SUN L , DU J , DAI L R , et al . Multiple-target deep learning for LSTM-RNN based speech enhancement [C ] // 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA) . San Francisco : IEEE , 2017 : 136 - 140 .

KOLBÆK M , TAN Z H , JENSEN J . Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2017 , 25 ( 1 ): 153 - 167 .

SALEEM N , KHATTAK M I , QAZI A B . Supervised speech enhancement based on deep neural network [J ] . Journal of Intelligent & Fuzzy Systems , 2019 , 37 ( 4 ): 5187 - 5201 .

SALEEM N , KHATTAK M I , ALI M , et al . Deep neural network for supervised single-channel speech enhancement [J ] . Archives of Acoustics , 2019 , 44 : 3 - 12 .

ZÃO L , COELHO R , FLANDRIN P . Speech enhancement with EMD and Hurst-based mode selection [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2014 , 22 ( 5 ): 899 - 911 .

WILLIAMSON D S , WANG Y X , WANG D L . Complex ratio masking for monaural speech separation [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2016 , 24 ( 3 ): 483 - 492 .

ISIK Y , ROUX J L , CHEN Z , et al . Single-channel multi-speaker separation using deep clustering [EB/OL ] . ( 2016 )[2021 ] . https://arxiv.org/abs/1607.02173 https://arxiv.org/abs/1607.02173 .

FENG W J , GUAN N Y , LI Y , et al . Audio visual speech recognition with multimodal recurrent neural networks [C ] // 2017 International Joint Conference on Neural Networks (IJCNN) . Anchorage : IEEE , 2017 : 681 - 688 .

MROUEH Y , MARCHERET E , GOEL V . Deep multimodal learning for audio-visual speech recognition [C ] // 2015 IEEE International Conference on Acoustics, Speech and Signal Processing . South Brisbane : IEEE , 2015 : 2130 - 2134 .

NGIAM J , KHOSLA A , KIM M , et al . Multimodal deep learning [C ] // Proceedings of the 28th International Conference on Machine Learning . Washington : DBLP , 2009 : 1 - 9 .

CHUNG J S , SENIOR A , VINYALS O , et al . Lip reading sentences in the wild [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu : IEEE , 2017 : 3444 - 3453 .

EPHRAT A , HALPERIN T , PELEG S . Improved speech reconstruction from silent video [C ] // 2017 IEEE International Conference on Computer Vision Workshops . Venice : IEEE , 2017 : 455 - 462 .

颜霖煌 . 基于图像边缘保持滤波技术的语音增强算法研究 [D ] . 广州 : 广州大学 , 2020 .

YAN L H . Research on Speech Enhancement Algorithm Based on Image Edge Preserving Filter [D ] . Guangzhou : Guangzhou University , 2020 . (in Chinese)

GABBAY A , EPHRAT A , HALPERIN T , et al . Seeing through noise: Visually driven speaker separation and enhancement [C ] // 2018 IEEE International Conference on Acoustics, Speech and Signal Processing . Calgary : IEEE , 2018 : 3051 - 3055 .

TAN K , XU Y , ZHANG S X , et al . Audio-visual speech separation and dereverberation with a two-stage multimodal network [J ] . IEEE Journal of Selected Topics in Signal Processing , 2020 , 14 ( 3 ): 542 - 553 .

GU R Z , ZHANG S X , XU Y , et al . Multi-modal multi-channel target speech separation [J ] . IEEE Journal of Selected Topics in Signal Processing , 2020 , 14 ( 3 ): 530 - 541 .

LLAGOSTERA CASANOVAS A , MONACI G , VANDERGHEYNST P , et al . Blind audiovisual source separation based on sparse redundant representations [J ] . IEEE Transactions on Multimedia , 2010 , 12 ( 5 ): 358 - 371 .

HOU J C , WANG S S , LAI Y H , et al . Audio-visual speech enhancement using multimodal deep convolutional neural networks [J ] . IEEE Transactions on Emerging Topics in Computational Intelligence , 2018 , 2 ( 2 ): 117 - 128 .

OUYANG Z H , YU H J , ZHU W P , et al . A fully convolutional neural network for complex spectrogram processing in speech enhancement [C ] // 2019 IEEE International Conference on Acoustics, Speech and Signal Processing . Brighton : IEEE , 2019 : 5756 - 5760 .

TORFI A , IRANMANESH S M , NASRABADI N , et al . 3D convolutional neural networks for cross audio-visual matching recognition [J ] . IEEE Access , 2017 , 5 : 22081 - 22091 .

EPHRAT A , MOSSERI I , LANG O , et al . Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation [J ] . ACM Transactions on Graphics , 2018 , 37 ( 4 ): 112 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Cross-Layer Attention Feature Interaction and Multi-Scale Channel Attention Network for Single Image Dehazing

Direct Tracking for a Non-Circular Source Based on CNN+BiLSTM Neural Network in the Presence of Modeling Errors

Speech Separation Based on Sound Localization and Auditory Masking Effect

A New Single-Channel Speech Separation Method Based on Sparse Decomposition

Related Author

Shun-bo WANG

Xiao-xia GUO

Yu-lan HAN

Shou-qiang KANG

Chao-feng LAN

WAN Jun

YU Mei

DAN Zhi-ping

Related Institution

School of Measurement and Control Technology and Communication Engineering， Harbin University of Science and Technology

School of Information and Safety Engineering, Zhongnan University of Economics and Law

Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University

College of Computer and Information Technology, China Three Gorges University

Institute of Information System Engineering， PLA Strategic Support Force Information Engineering University

⁰