华南理工大学电子与信息学院,广东广州 510640
[ "陈习坤 男,1998年生于江西赣州.华南理工大学电子与信息学院研究生.研究方向为语音超分辨率、语音带宽扩展.E-mail: 2363862709@qq.com" ]
[ "杨俊美(通讯作者) 女,1979年生于山东济南.华南理工大学电子与信息学院硕士生导师.研究方向为智能信号处理、自适应滤波、图像超分辨率重建、语音去混响等. Email: yjunmei@scut.edu.cn" ]
收稿:2022-04-12,
修回:2022-06-25,
纸质出版:2023-04-25
移动端阅览
陈习坤,杨俊美.基于离散小波包变换与胶囊生成对抗网络的语音超分辨率算法[J].电子学报,2023,51(04):1039-1049.
CHEN Xi-kun,YANG Jun-mei.Speech Super-Resolution Algorithm Based on Discrete Wavelet Packet Transform and Capsule Generative Adversarial Network[J].ACTA ELECTRONICA SINICA,2023,51(04):1039-1049.
陈习坤,杨俊美.基于离散小波包变换与胶囊生成对抗网络的语音超分辨率算法[J].电子学报,2023,51(04):1039-1049. DOI: 10.12263/DZXB.20220395.
CHEN Xi-kun,YANG Jun-mei.Speech Super-Resolution Algorithm Based on Discrete Wavelet Packet Transform and Capsule Generative Adversarial Network[J].ACTA ELECTRONICA SINICA,2023,51(04):1039-1049. DOI: 10.12263/DZXB.20220395.
目前主流的语音超分辨率(Speech Super-Resolution,SSR)算法是使用卷积神经网络(Convolutional Neural Networks,CNN)把低分辨率(Low-Resolution,LR)语音信号转换为高分辨率(High-Resolution,HR)的语音信号.但只使用普通的CNN所带来的效果通常比较平滑且缺少细节信息.生成对抗网络(Generative Adversarial Networks,GAN)的引入可以很好地解决这一问题.此外,胶囊网络(Capsule Networks,CapsNet)可以将空间信息编码为特征,这样与GAN结合可以更好地判断数据的真假.离散小波变换(Discrete Wavelet Transform,DWT)是一种正交多分辨分析的工具,它在信号处理方面有很出色的表现.小波变换的一个扩展是离散小波包变换(Discrete Wavelet Packet Transform,DWPT),它在某些应用中提供了更有效的信号分析.本文提出一种基于DWPT和胶囊生成对抗网络(CapsGAN)的SSR网络架构Wavelet-SRGAN.对比实验结果表明,本文所提的算法能以最少的参数实现与现有先进算法相当的性能.在算法上有几个核心步骤:(1)在生成器网络中加入DWPT层;(2)在鉴别器上加入胶囊网络;(3)训练时加入小波损失.
The currently popular algorithms of speech super-resolution (SSR) use convolutional neural networks (CNN) to transform the low-resolution (LR) speech signal into high-resolution (HR) speech signal. However
the HR signal reconstructed from the ordinary CNN network is usually smooth and lack of details. Generative adversarial networks (GAN) can effectively solve this problem and generate high-quality speech signal.In addition
capsule networks (CapsNet) can encode the spatial information into features
and the combination with GAN will effectively improve the ability of discriminator. Moreover
discrete wavelet transform (DWT) is a tool for orthogonal multi-resolution analysis
which has excellent performance in signal processing.An extension of DWT is discrete wavelet packet transform (DWPT)
which provides more efficient signal analysis in many applications. Based on the above mentioned DWPT and capsule generative adversarial networks (CapsGAN)
we propose an SSR network architecture in this paper
named as Wavelet-SRGAN. Comparative experiment results show that the proposed Wavelet-SRGAN can achieve comparable performance against current state-of-the-art methods with the least amount of parameters.The key steps and main contributions of our algorithm are as follows: (1) adding a DWPT layer to the generator networks; (2) imbedding a capsule network in the discriminator; (3) additional wavelet loss is considered in the training process.
International Elecommunication Union . Pulse code modulation (PCM) of voice frequencies [EB/OL ] . ( 1988-11-25 )[ 2022-04 ] . https://www.itu.int/rec/T-REC-G.711-198811-I/en https://www.itu.int/rec/T-REC-G.711-198811-I/en .
ISER NTERNATIONAB , SCHMIDT G . Neural networks versus codebooks in an application for bandwidth extension of speech signals [C ] // 8th European Conference on Speech Communication and Technology (Eurospeech 2003) . Geneva : ISCA , 2003 : 565 - 568 .
QIAN Y , KABAL P . Wideband speech recovery from narrowband speech using classified codebook mapping [C ] // Proceedings of the 9th Australian International Conference on Speech Science & Technology . Melbourne : Australian Speech Science & Technology Association Inc , 2002 : 106 - 111 .
LIU X , BAO C C , JIA M S , et al . A harmonic bandwidth extension based on Gaussian mixture model [C ] // IEEE 10th International Conference on Signal Processing Proceedings . Beijing : IEEE , 2010 : 474 - 477 .
JAX P , VARY P . Artificial bandwidth extension of speech signals using MMSE estimation based on a hidden Markov model [C ] // 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP'03) . Hong Kong : IEEE , 2003 : I .
RONNEBERGER O , FISCHER P , BROX T . U-Net: Convolutional networks for biomedical image segmentation [C ] // International Conference on Medical Image Computing and Computer-Assisted Intervention . Munich : Springer , 2015 : 234 - 241 .
KULESHOV V , ENAM S Z , ERMON S . Audio super resolution using neural networks [EB/OL ] . ( 2017-08-02 )[ 2022-04 ] . https://arxiv.org/abs/1708.00853 https://arxiv.org/abs/1708.00853 .
BIRNBAUM S , KULESHOV V , ENAM Z , et al . Temporal FiLM: Capturing long-range sequence dependencies with feature-wise modulations [EB/OL ] . ( 2019-09-14 )[ 2022-04 ] . https://arxiv.org/abs/1909.06628 https://arxiv.org/abs/1909.06628 .
LIM T Y , YEH R A , XU Y J , et al . Time-frequency networks for audio super-resolution [C ] // 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary : IEEE , 2018 : 646 - 650 .
WANG H M , WANG D L . Time-frequency loss for CNN based speech super-resolution [C ] // ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Barcelona : IEEE , 2020 : 861 - 865 .
FENG B , JIN Z Y , SU J Q , et al . Learning bandwidth expansion using perceptually-motivated loss [C ] // ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Brighton : IEEE , 2019 : 606 - 610 .
GOODFELLOW I , POUGET-ABADIE J , MIRZA M , et al . Generative adversarial networks [J ] . Communications of the ACM , 2020 , 63 ( 11 ): 139 - 144 .
LEDIG C , THEIS L , HUSZÁR F , et al . Photo-realistic single image super-resolution using a generative adversarial network [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Honolulu : IEEE , 2017 : 105 - 114 .
FU S W , LIAO C F , TSAO Y , et al . MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement [EB/OL ] . ( 2019-05-13 )[ 2022-04 ] . https://arxiv.org/abs/1905.04874 https://arxiv.org/abs/1905.04874 .
SU J Q , JIN Z Y , FINKELSTEIN A . HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks [EB/OL ] . ( 2020-06-10 )[ 2022-04 ] . https://arxiv.org/abs/2006.05694 https://arxiv.org/abs/2006.05694 .
CHEN X K , YANG J M . Speech bandwidth extension based on Wasserstein generative adversarial network [C ] // 2021 IEEE 21st International Conference on Communication Technology (ICCT) . Tianjin : IEEE , 2021 : 1356 - 1362 .
徐峰 , 李平 . 基于FFTNet-GAN的音频超分辨率方法研究 [J ] . 信号处理 , 2021 , 37 ( 1 ): 59 - 65 .
XU F , LI P . Research on audio super-resolution method based on FFTNet-GAN [J ] . Journal of Signal Processing , 2021 , 37 ( 1 ): 59 - 65 . (in Chinese)
MAJDABADI M M , KO S B . MSG-CapsGAN: Multi-scale gradient capsule GAN for face super resolution [C ] // 2020 International Conference on Electronics, Information, and Communication (ICEIC) . Barcelona : IEEE , 2020 : 1 - 3 .
MAJDABADI M M , CHOI Y , DEIVALAKSHMI S , et al . Capsule GAN for prostate MRI super-resolution [J ] . Multimedia Tools and Applications , 2022 , 81 ( 3 ): 4119 - 4141 .
SABOUR S , FROSST N , HINTON G E . Dynamic routing between capsules [EB/OL ] . ( 2017-10-26 )[ 2022-04 ] . https://arxiv.org/abs/1710.09829 https://arxiv.org/abs/1710.09829 .
HUANG H B , HE R , SUN Z N , et al . Wavelet-SRNet: A wavelet-based CNN for multi-scale face super resolution [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Venice : IEEE , 2017 : 1698 - 1706 .
VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all You need [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach : ACM , 2017 : 6000 - 6010 .
KUMAR R , KUMAR K , ANAND V , et al . NU-GAN: High resolution neural upsampling with GAN [EB/OL ] . ( 2020-10-22 )[ 2022-04 ] . https://arxiv.org/abs/2010.11362 https://arxiv.org/abs/2010.11362 .
MALLAT S G . A theory for multiresolution signal decomposition: The wavelet representation [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 1989 , 11 ( 7 ): 674 - 693 .
RUCH D K , VAN FLEET P J . Wavelet Theory: An Elementary Approach with Application [M ] . Hoboken : John Wiley & Sons, Inc. , 2009 .
DAUBECHIES I . Ten Lectures on Wavelets [M ] . Philadelphia : Society for Industrial and Applied Mathematics , 1992 .
SHI W Z , CABALLERO J , HUSZÁR F , et al . Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las vegas : IEEE , 2016 : 1874 - 1883 .
ODENA A , DUMOULIN V , OLAH C . Deconvolution and checkerboard artifacts [J/OL ] . Distill , ( 2016 )[ 2022-04 ] . http://distill.pub/2016/deconv-checkerboard http://distill.pub/2016/deconv-checkerboard .
YAMAGISHI J , VEAUX C , MacDonald K . CSTR VCTK corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92) [EB/OL ] . ( 2019-11-13 )[ 2022-04 ] . https://datashare.ed.ac.uk/handle/10283/3443 https://datashare.ed.ac.uk/handle/10283/3443 .
ABADI M , BARHAM P , CHEN J M , et al . TensorFlow: A system for large-scale machine learning [C ] // Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation . Savannah : USENIX Association , 2016 : 265 - 283 .
KINGMA D P , BA J . Adam: A method for stochastic optimization [EB/OL ] . ( 2014-10-22 )[ 2022-04 ] . https://arxiv.org/abs/1412.6980 https://arxiv.org/abs/1412.6980 .
BOTTOU L . Large-Scale Machine Learning with Stochastic Gradient Descent [M ] // 19th International Conference on Computational Statistics . Paris : Physica Heidelberg , 2010 : 177 - 186 .
GRAY A , MARKEL J . Distance measures for speech processing [J ] . IEEE Transactions on Acoustics, Speech, and Signal Processing , 1976 , 24 ( 5 ): 380 - 391 .
LE ROUX J , WISDOM S , ERDOGAN H , et al . SDR-half-baked or well done? [C ] // ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Brighton : IEEE , 2019 : 626 - 630 .
LI Y P , TAGLIASACCHI M , RYBAKOV O , et al . Real-time speech frequency bandwidth extension [C ] // ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Toronto : IEEE , 2021 : 691 - 695 .
HERSHEY S , CHAUDHURI S , ELLIS D P W , et al . CNN architectures for large-scale audio classification [C ] // 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . New Orleans : IEEE , 2017 : 131 - 135 .
RIX A W , BEERENDS J G , HOLLIER M P , et al . Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs [C ] // 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing . Salt Lake City : IEEE , 2002 : 749 - 752 .
0
浏览量
12
下载量
1
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621