LIAN Xiao-yu, XIA Nan, DAI Gao-le, et al. An In-Vehicle Interaction Speech Enhancement and Recognition Method Based on Lightweight Models in Complex Environment[J]. Acta Electronica Sinica, 2024, 52(04): 1282-1287.
LIAN Xiao-yu, XIA Nan, DAI Gao-le, et al. An In-Vehicle Interaction Speech Enhancement and Recognition Method Based on Lightweight Models in Complex Environment[J]. Acta Electronica Sinica, 2024, 52(04): 1282-1287. DOI:10.12263/DZXB.20230905
An In-Vehicle Interaction Speech Enhancement and Recognition Method Based on Lightweight Models in Complex Environment
In order to solve the problem of low recognition rate of in-vehicle voice interaction in complex noise environment and difficult deployment on devices with limited computing resources
this article proposes a lightweight and robust voice recognition method based on joint training framework in the noisy environment. The speech enhancement model introduces a multi-scale channel time-frequency attention module to extract multi-scale time-frequency features and key information in various dimensions. In the speech recognition model
multi-head element-wise linear attention is proposed
which significantly reduces the computational complexity required for the attention module. Experiments show that the joint training model shows good noise robustness on the self-made dataset.
YUAN W H , HU S D , SHI Y L , et al . A convolutional gated recurrent network for speech enhancement [J ] . Acta Electronica Sinica , 2020 , 48 ( 7 ): 1276 - 1283 . (in Chinese)
FAN C H , ZHANG H M , YI J Y , et al . SpecMNet: Spectrum mend network for monaural speech enhancement [J ] . Applied Acoustics , 2022 , 194 : 108792 .
XU X M , TU W P , YANG Y H . CASE-Net: Integrating local and non-local attention operations for speech enhancement [J ] . Speech Communication , 2023 , 148 : 31 - 39 .
GULATI A , QIN J , C C CHIUet al . Conformer: Convolution-augmented Transformer for speech recognition [C ] // Interspeech 2020 . Singapore : ISCA , 2020 : 5036 - 5040 .
VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 5999 - 6009 .
LI S Q , XU M L , ZHANG X L , et al . Efficient conformer-based speech recognition with linear attention [C ] // Asia-Pacific Signal and Information Processing Association Annual Summit and Conference . New York : IEEE , 2021 : 448 - 453 .
LI Y T , QU D , YANG X K , et al . Speech recognition model based on improved linear attention mechanism [J ] . Journal of Signal Processing , 2023 , 39 ( 3 ): 516 - 525 . (in Chinese)
FAN C H , DING M M , YI J Y , et al . Two-stage deep spectrum fusion for noise-robust end-to-end speech recognition [J ] . Applied Acoustics , 2023 , 212 : 109547 .
ZHU Q S , ZHANG J , ZHANG Z Q , et al . A joint speech enhancement and self-supervised representation learning framework for noise-robust speech recognition [J ] . ACM Transactions on Audio, Speech, and Language Processing , 2023 , 31 : 1927 - 1939 .