

浏览全部资源
扫码关注微信
北京工业大学信息学部语音与音频信息处理研究所,北京 100124
Received:14 November 2023,
Revised:2024-03-05,
Published:25 August 2024
移动端阅览
黄晋维, 鲍长春, 周静. 基于先验梅尔谱和神经声码器的语音丢包隐藏方法[J]. 电子学报, 2024, 52(08): 2581-2590.
HUANG Jin-wei, BAO Chang-chun, ZHOU Jing. A Speech Packet Loss Concealment Method Based on Priori Mel-Spectrum and Neural Vocoder[J]. Acta Electronica Sinica, 2024, 52(08): 2581-2590.
黄晋维, 鲍长春, 周静. 基于先验梅尔谱和神经声码器的语音丢包隐藏方法[J]. 电子学报, 2024, 52(08): 2581-2590. DOI:10.12263/DZXB.20231056
HUANG Jin-wei, BAO Chang-chun, ZHOU Jing. A Speech Packet Loss Concealment Method Based on Priori Mel-Spectrum and Neural Vocoder[J]. Acta Electronica Sinica, 2024, 52(08): 2581-2590. DOI:10.12263/DZXB.20231056
对基于神经网络的丢包隐藏方法而言,输入特征是直接影响最终恢复效果的重要因素.此外,如何通过丢包隐藏恢复高自然度的语音,也是亟待解决的难题.为有效恢复丢包语音并提高自然度,本文提出了一种基于先验梅尔谱和神经声码器的语音丢包隐藏方法.该方法采用一种非对称的编解码网络结构.在编码端,用两个独立的编码网络分别从时域波形和梅尔谱中提取深层时频特征.在解码端,将时频深层特征一同送入由时序自适应反归一化层构成的声码器中,以恢复丢失的语音信号并提高自然度.仿真实验表明,该方法在语音感知质量和短时客观可懂度上均优于现有的两种丢包隐藏算法.
For the neural network-based speech Packet Loss Concealment (PLC)
the input features are crucial factors that directly affect the final recovery performance. Additionally
the challenge of restoring high natural speech through PLC remains to be addressed. To effectively recover packet loss speech and improve its naturalness
this paper proposes a PLC method of speech signal based on the priori Mel-spectrum and neural vocoder. The proposed method adopts an asymmetric encoding and decoding network structure. At the encoding stage
this method utilizes two independent encoding networks to extract the latent time-frequency features from the waveform and Mel-spectrogram
respectively. At the decoding stage
the latent time-frequency features are jointly fed into a neural vocoder which is composed of several temporal adaptive denormalization layer to restore the lost speech signals and enhance the naturalness. Simulation experiments demonstrate that the proposed method outperforms two existing packet loss concealment algorithms in terms of perceptual evaluation of speech quality and short-time objective intelligibility.
LEE B K , CHANG J H . Packet loss concealment based on deep neural networks for digital speech transmission [J ] . IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2016 , 24 ( 2 ): 378 - 387 .
NGUYEN V A , NGUYEN A H T , KHONG A W H . Improving performance of real-time full-band blind packet-loss concealment with predictive network [C ] // 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2023 : 1 - 5 .
XUE H Y , PENG X L , LU Y . Contrast-PLC: Contrastive learning for packet loss concealment [C ] // 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2023 : 1 - 5 .
KONG J , KIM J , BAE J . HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis [C ] // Proceedings of the 34th International Conference on Neural Information Processing Systems . New York : ACM , 2020 : 17022 - 17033 .
ZHOU Y , BAO C C , HUANG J W , et al . A neural vocoder based packet loss concealment algorithm [C ] // 2022 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC) . Piscataway : IEEE , 2022 : 1 - 5 .
OU L L , CHEN Y P . Concealing audio packet loss using frequency-consistent generative adversarial networks [C ] // 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI) . Piscataway : IEEE , 2022 : 826 - 831 .
PRENGER R , VALLE R , CATANZARO B . Waveglow: A flow-based generative network for speech synthesis [C ] // 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2019 : 3617 - 3621 .
KIM J H , LEE S H , LEE J H , et al . Fre-GAN: Adversarial frequency-consistent audio synthesis [C ] // Interspeech 2021 . Brno : ISCA , 2021 : 2197 - 2201 .
LOTFIDERESHGI R , GOURNAY P . Speech prediction using an adaptive recurrent neural network with application to packet loss concealment [C ] // 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2018 : 5394 - 5398 .
MOHAMED M M , SCHULLER B W . ConcealNet: An end-to-end neural network for packet loss concealment in deep speech emotion recognition [EB/OL ] . ( 2005 )[2023 ] . http://arxiv.org/abs/2005.07777 http://arxiv.org/abs/2005.07777 .
LIN J , WANG Y , KALGAONKAR K , et al . A time-domain convolutional recurrent network for packet loss concealment [C ] // 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2021 : 7148 - 7152 .
VERMA P , MEZZA A I , CHAFE C , et al . A deep learning approach for low-latency packet loss concealment of audio signals in networked music performance applications [C ] // 2020 27th Conference of Open Innovations Association (FRUCT) . Piscataway : IEEE , 2020 : 268 - 275 .
STIMBERG F , NAREST A , BAZZICA A , et al . WaveNetEQ—Packet loss concealment with WaveRNN [C ] // 2020 54th Asilomar Conference on Signals, Systems, and Computers . Piscataway : IEEE , 2020 : 672 - 676 .
KALCHBRENNER N , ELSEN E , SIMONYAN K , et al . Efficient neural audio synthesis [EB/OL ] . ( 2018 )[2023 ] . http://arxiv.org/abs/1802.08435 http://arxiv.org/abs/1802.08435 .
WANG J , GUAN Y S , ZHENG C S , et al . A temporal-spectral generative adversarial network based end-to-end packet loss concealment for wideband speech transmission [J ] . The Journal of the Acoustical Society of America , 2021 , 150 ( 4 ): 2577 - 2588 .
MUSTAFA A , PIA N , FUCHS G . StyleMelGAN: an efficient high-fidelity adversarial vocoder with temporal adaptive normalization [C ] // 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2021 : 6034 - 6038 .
PARK T , LIU M Y , WANG T C , et al . Semantic image synthesis with spatially-adaptive normalization [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 2332 - 2341 .
ZHANG W D , ZHU J W , TAI Y , et al . Context-aware image inpainting with learned semantic priors [C ] // Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI) . California : International Joint Conferences on Artificial Intelligence Organization , 2021 : 1 - 7 .
PANDEY A , WANG D L . TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain [C ] // 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2019 : 6875 - 6879 .
YANG G , YANG S , LIU K , et al . Multi-band melgan: Faster waveform generation for high-quality text-to-speech [C ] // 2021 IEEE Spoken Language Technology Workshop (SLT) . Piscataway : IEEE , 2021 : 492 - 498 .
JI Q , BAO C C , CUI Z H . Packet loss concealment based on phase correction and deep neural network [J ] . Applied Sciences , 2022 , 12 ( 19 ): 9721 .
PANAYOTOV V , CHEN G G , POVEY D , et al . Librispeech: An ASR corpus based on public domain audio books [C ] // 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2015 : 5206 - 5210 .
MUSHKIN M , BAR-DAVID I . Capacity and coding for the gilbert-elliott channels [J ] . IEEE Transactions on Information Theory , 1989 , 35 ( 6 ): 1277 - 1290 .
0
Views
15
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621