An End-to-End Chinese Speech Recognition Algorithm Integrating Language Model

LÜ Kun-ru; WU Chun-guo; LIANG Yan-chun; YUAN Yu-ping; REN Zhi-min; ZHOU You; SHI Xiao-hu

doi:10.12263/DZXB.20201187

您当前的位置：

首页 >

文章列表页 >

An End-to-End Chinese Speech Recognition Algorithm Integrating Language Model

PAPERS | 更新时间：2025-12-08

- An End-to-End Chinese Speech Recognition Algorithm Integrating Language Model
- ACTA ELECTRONICA SINICA Vol. 49, Issue 11, Pages: 2177-2185(2021)
- 作者机构：
  
  1.吉林大学计算机科学与技术学院，吉林长春 130012
  2.吉林大学符号计算与知识工程教育部重点实验室，吉林长春 130012
  3.珠海科技学院计算机学院，广东珠海 519041
- 作者简介：
- 基金信息：
- DOI：10.12263/DZXB.20201187
  CLC： TP18;TP39
- Received：23 October 2020，
  
  Revised：2021-07-20，
  
  Published：25 November 2021
- 稿件说明：
移动端阅览
吕坤儒,吴春国,梁艳春等.融合语言模型的端到端中文语音识别算法[J].电子学报,2021,49(11):2177-2185.

LÜ Kun-ru,WU Chun-guo,LIANG Yan-chun,et al.An End-to-End Chinese Speech Recognition Algorithm Integrating Language Model[J].ACTA ELECTRONICA SINICA,2021,49(11):2177-2185.
吕坤儒,吴春国,梁艳春等.融合语言模型的端到端中文语音识别算法[J].电子学报,2021,49(11):2177-2185. DOI： 10.12263/DZXB.20201187.

LÜ Kun-ru,WU Chun-guo,LIANG Yan-chun,et al.An End-to-End Chinese Speech Recognition Algorithm Integrating Language Model[J].ACTA ELECTRONICA SINICA,2021,49(11):2177-2185. DOI： 10.12263/DZXB.20201187.

摘要

为了解决语音识别模型在识别中文语音时鲁棒性差，缺少语言建模能力而无法有效区分同音字或近音字的不足，本文提出了融合语言模型的端到端中文语音识别算法.算法建立了一个基于深度全序列卷积神经网络和联结时序分类的从语音到拼音的语音识别声学模型，并借鉴Transformer的编码模型，构建了从拼音到汉字的语言模型，之后通过设计语音帧分解模型将声学模型的输出和语言模型的输入相连接，克服了语言模型误差梯度无法传递给声学模型的难点，实现了声学模型和语言模型的联合训练.为验证本文方法，在实际数据集上进行了测试.实验结果表明，语言模型的引入将算法的字错误率降低了21%，端到端的联合训练算法起到了关键作用，其对算法的影响达到了43%.和已有5种主流算法进行比较的结果表明本文方法的误差明显低于其他5种对比模型，与结果最好的DeepSpeech2模型相比字错误率降低了28%.

Abstract

To address the problems of poor robustness

lack of language modeling ability and inability to distinguish between homophones or near-tone characters effectively in the recognition of Chinese speech

an end-to-end Chinese speech recognition algorithm integrating language model is proposed. Firstly

an acoustic model from speech to Pinyin is established based on Deep Fully Convolutional Neural Network (DFCNN) and Connectionist Temporal Classification (CTC). Then the language model from Pinyin to Chinese character is constructed by using the encoder of Transformer. Finally

the speech frame decomposition model is designed to link the output of the acoustic model with the input of the language model

which overcomes the difficulty that the gradient of loss function cannot be passed from the language model to the acoustic model

and realizes the end-to-end training of the acoustic model and the language model. Real data sets are applied to verify the proposed method. Experimental results show that the introduction of language model reduces the word error rate (WER) of the algorithm by 21%

and the end-to-end integrating training algorithm plays a key role

which improves the performance by 43%. Compared with five up-to-date algorithms

our method achieves a 28% WER

lower than that of the best model among comparison methods—DeepSpeech2.

关键词

Keywords

references

杨明浩 , 高廷丽 , 陶建华 , 等 . 对话意图及语音识别错误对交互体验的影响 [J]. 软件学报 , 2016 , 27 ( S2 ): 69 - 75 .

Yang MH , Gao TL , Tao JH , et al . Error analysis of intention classification and speech recognition in human-computer dialog [J]. Journal of Software , 2016 , 27 ( S2 ): 69 - 75 . (in Chinese)

Rodriguez E , RuiZ B , Garcia-Crespo A , et al . Speech/speaker recognition using a HMM/GMM hybrid model [A]. International Conference on Audio-and Video-Based Biometric Person Authentication [C]. Berlin, Heidelberg : Springer , 1997 . 227 - 234 .

Mohamed AR , Sainath TN , Dahl G , et al . Deep belief networks using discriminative features for phone recognition [A]. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing [C]. Prague, Czech Republic : IEEE , 2011 . 5060 - 5063 .

Yu D , Deng L . Deep learning and its applications to signal and information processing [J]. IEEE Signal Processing Magazine , 2011 , 28 ( 1 ): 145 - 154 .

Graves A , Mohamed AR , Hinton G . Speech recognition with deep recurrent neural networks [A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) [C]. Vancouver, Canada : IEEE , 2013 . 6645 - 6649 .

Sak H , Senior A , Beaufays F . Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition [A]. The 15th Annual Conference of the International Speech Communication Association [C]. Singapore : ISCA , 2014 . 338 - 342 .

Abdel-Hamid O , Mohamed AR , Jiang H , et al . Convolutional neural networks for speech recognition [J]. IEEE/ACM Transactions on Audio Speech & Language Processing , 2014 , 22 ( 10 ): 1533 - 1545 .

Graves A , Fernández S , Gomez F , et al . Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks [A]. International Conference on Machine Learning, ICML 2006 [C]. Pittsburgh, PA : ACM , 2006 . 369 - 376 .

Zhang Y , Pezeshki M , Brakel P , et al . Towards end-to-end speech recognition with deep convolutional neural networks [A]. The 17th Annual Conference of the International Speech Communication Association [C]. San Francisco, CA : ISCA , 2016 . 410 - 414 .

Yang XD , Wang WZ , Yang HW , et al . Simple data augmented transformer end-to-end Tibetan speech recognition [A]. IEEE 3rd International Conference on Information Communication and Signal Processing [C]. NY : IEEE , 2020 . 148 - 152 .

Chang HJ , Liu AH , Lee HY , et al . End-to-end whispered speech recognition with frequency-weighted approaches and pseudo whisper pre-training [A]. IEEE Spoken Language Technology Workshop [C]. NY : IEEE , 2021 . 186 - 193 .

Fan CH , Yi JY , Tao JH , et al . Gated recurrent fusion with joint training framework for robust end-to-end speech recognition [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2021 , 29 : 198 - 209 .

Graves A , Jürgen S . Framewise phoneme classification with bidirectional LSTM and other neural network architectures [J]. Neural Networks , 2005 , 18 ( 5-6 ): 602 - 610 .

Sainath TN , Vinyals O , Senior A , et al . Convolutional , long short-term memory, fully connected deep neural networks [A]. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) [C]. NY : IEEE , 2015 . 4580 - 4584 .

Amodei D , Ananthanarayanan S , Anubhai R , et al . Deep speech 2: end-to-end speech recognition in English and Mandarin [A]. International Conference on Machine Learning 2016 [C]. NY : ACM , 2016 . 173 - 182 .

王海坤 , 潘嘉 , 刘聪 . 语音识别技术的研究进展与展望 [J]. 电信科学报 , 2018 , 2 : 1 - 11 .

Wang HK , Pan J , Liu C . Research development and forecast of automatic speech recognition technologies [J]. Telecommunications Science , 2018 , 2 : 1 - 11 . (in Chinese)

Kannan A , Wu YH , Nguyen P , et al . An analysis of incorporating an external language model into a sequence-to-sequence model [A]. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing [C]. Calgary, Canada : IEEE , 2017 . 5824 - 5828 .

Gulcehre C , Firat O , Xu K , et al . On using monolingual corpora in neural machine translation [OL]. http://arxiv.org/abs/1503.03535 , 2015 .

Anuroop S , Heewoo J , Sanjeev S , et al . Cold fusion: Training seq2seq models together with language models [A]. The 19th Annual Conference of the International Speech Communication Association [C]. Hyderabad, India : ISCA , 2018 . 387 - 391 .

Toshniwal S , Kannan A , Chiu CC , et al . A comparison of techniques for language model integration in encoder-decoder speech recognition [A]. IEEE Workshop on Spoken Language Technology [C]. Athens, Greece : IEEE , 2018 . 369 - 375 .

Vaswani A , Shazeer N , Parmar N , et al . Attention is all you need [A]. Advances in Neural Information Processing Systems [C]. Long Beach, CA : MIT Press , 2017 . 5998 - 6008 .

Bu H , Du J , Na X , et al . Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline [A]. The 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) [C]. Seoul, South Korea : IEEE , 2017 . 58 - 62 .

Wang CH , Zhang M , Ma SP , et al . Automatic online news issue construction in Web environment [A]. The 17th International World Wide Web Conference [C]. Beijing, China : ACM , 2008 . 457 - 466 .

Kingma D , Ba J . Adam: a method for stochastic optimization [A]. IEEE 17th International Conference on Computational Science and Engineering (CSE) [C]. Chengdu, China : IEEE , 2014 . 563 - 568 .

Sergey I , Christian S . Batch normalization: accelerating deep network training by reducing internal covariate shift [A]. International Conference on Machine Learning 2015 [C]. Lille France : ACM , 2015 . 448 - 456 .

Srivastava N , Hinton G , Krizhevsky A , et al . Dropout: a simple way to prevent neural networks from overfitting [J]. Journal of Machine Learning Research , 2014 , 15 ( 1 ): 1929 - 1958 .

Graves A , Jaitly N . Towards end-to-end speech recognition with recurrent neural networks [A]. International Conference on Machine Learning [C]. Beijing, China : JMLR , 2014 , 32 ( 2 ): 1764 - 1772 .

胡章芳 , 徐轩 , 付亚芹 , 等 . 基于ResNet-BLSTM的端到端语音识别 [J]. 计算机工程与应用 , 2020 , 56 ( 18 ): 124 - 130 .

Hu ZF , Xu X , Fu YQ , et al . End to end speech recognition based on ResNet-BLSTM [J]. Computer Engineering and Applications , 2020 , 56 ( 18 ): 124 - 130 . (in Chinese)

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Research Issues for Chinese Speech Recognition

An In-Vehicle Interaction Speech Enhancement and Recognition Method Based on Lightweight Models in Complex Environment

A Survey of Text Generation and Evaluation Based on Intrinsic Quality Constraints

Related Author

杜利民

侯自强

张瑞强

王作英

陆大

LIU Jia

LIAN Xiao-yu

XIA Nan

Related Institution

Institute of Acoustics, Chinese Academy of Sciences

100084

Department of Electrical Engineering, Tsinghua Univ.

Department of Electronic Engineering,Tsinghua University

Department of Electronic EngineeringTsinghua UniversityBeijing 100084China

⁰