一种多尺度前向注意力模型的语音识别方法

doi:10.3969/j.issn.0372-2112.2020.07.002

PDF(688 KB)

电子学报 ›› 2020, Vol. 48 ›› Issue (7) : 1255-1260. DOI: 10.3969/j.issn.0372-2112.2020.07.002

学术论文

一种多尺度前向注意力模型的语音识别方法

唐海桃, 薛嘉宾, 韩纪庆

作者信息 +

A Method of Multi-Scale Forward Attention Model for Speech Recognition

TANG Hai-tao, XUE Jia-bin, HAN Ji-qing

Author information +

文章历史 +

摘要

注意力模型是当前语音识别中的主流模型，然而其存在一个缺点，即当前时刻的注意力模型可能产生异常得分.为此，本文首先提出前向注意力模型，其采用上一时刻正常注意力得分平滑当前时刻异常得分.接着通过对上一时刻的注意力得分添加约束因子来对前向注意力模型进行优化，达到自适应平滑的目的.最后，在优化模型基础上提出多尺度前向注意力模型，其通过引入多尺度模型来对不同等级的语音基元进行建模，进而将所得到的不同等级目标向量进行融合，以达到解决注意力得分异常值的目的.采用SwitchBoard作为训练集，Hub5'00作为测试集进行实验，相比于基线系统，多尺度前向注意力模型的词错误率（Word Error Rate，WER）相对降低14.28%.

Abstract

Attention-based model is a popular model in speech recognition, however it has a disadvantage that the attention-based model may produce abnormal scores. To solve this problem, this paper first proposes a forward attention model, which adopts normal attention score at the previous moment to smooth the abnormal score at the current moment. Then, the model is optimized to add constraint factors to the attention score at the previous moment to achieve the purpose of adaptive smoothing of the above abnormal scores. Then, a multi-scale forward attention model is proposed on the above model. This model introduces a multi-scale method to model the speech primitives of different levels, and then fuses the target vectors of different levels to solve the outliers of attention score. In the experiment, SwitchBoard is adopted as the training set and Hub5'00 as the test set. Compared with the baseline system, the Word Error Rate (WER) of the proposed system decreased by 14.28% relatively.

导出引用

唐海桃, 薛嘉宾, 韩纪庆. 一种多尺度前向注意力模型的语音识别方法[J]. 电子学报, 2020, 48(7): 1255-1260. https://doi.org/10.3969/j.issn.0372-2112.2020.07.002

TANG Hai-tao, XUE Jia-bin, HAN Ji-qing. A Method of Multi-Scale Forward Attention Model for Speech Recognition[J]. Acta Electronica Sinica, 2020, 48(7): 1255-1260. https://doi.org/10.3969/j.issn.0372-2112.2020.07.002

中图分类号： TN912.34

参考文献

[1] KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenet classification with deep convolutional neural networks[A].Advances in Neural Information Processing Systems[C].[S.l.]:NIPS,2012.1097-1105.
[2] CIRESAN D,GIUSTI A,et al.Deep neural networks segment neuronal membranes in electron microscopy images[A].Advances in neural information processing systems[C].[S.l.]:NIPS,2012.2843-2851.
[3] DAHL G E,YU D,DENG L,et al.Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J].IEEE Transactions on Audio,Speech,and Language Processing,2011,20(1):30-42.
[4] DENG L,HINTON G,KINGSBURY B.New types of deep neural network learning for speech recognition and related applications:An overview[A].2013 IEEE International Conference on Acoustics,Speech and Signal Processing[C].USA:IEEE,2013.8599-8603.
[5] COLLOBERT R,WESTON J.A unified architecture for natural language processing:Deep neural networks with multitask learning[A].Proceedings of the 25th International Conference on Machine Learning[C].USA:ACM,2008.160-167.
[6] HIRSCHBERG J,et al.Advances in natural language processing[J].Science,2015,349(6245):261-266.
[7] BAHDANAU D,CHOROWSKI J,et al.End-to-end attention-based large vocabulary speech recognition[A].2016 IEEE International Conference on Acoustics,Speech and Signal Processing[C].USA:IEEE,2016.4945-4949.
[8] GAO F C,XIN L I,YONG H Y.Using highway connections to enable deep small-footprint LSTM-RNNs for speech recognition[J].Chinese Journal of Electronics,2019,28(1):107-112.
[9] ZEYER A,IRIE K,et al.Improved training of end-to-end attention models for speech recognition[A].Interspeech[C].Hyderabad:[s.n.],2018.1845-1859.
[10] MERBOLDT A,ZEYER A,et al.An analysis of local monotonic attention variants[A].Interspeech[C].Graz:[s.n.],2019.1398-1402.
[11] BAHAR P,ZEYER A,et al.On using 2D sequence-to-sequence models for speech recognition[A].IEEE International Conference on Acoustics,Speech and Signal Processing[C].USA:IEEE,2019.5671-5675.
[12] ZWEIG G,et al.Advances in all-neural speech recognition[A].IEEE International Conference on Acoustics,Speech and Signal Processing[C].USA:IEEE,2017.4805-4809.
[13] CHOROWSKI J,BAHDANAU D,CHO K,et al.End-to-end continuous speech recognition using attention-based recurrent nn:First results[J].Eprint Arxiv,2014.
[14] CHAN W,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech recognition[A].IEEE International Conference on Acoustics,Speech and Signal Processing[C].USA:IEEE,2016.4960-4964.
[15] BAHDANAU D,CHOROWSKI J,et al.End-to-end attention-based large vocabulary speech recognition[A].IEEE International Conference on Acoustics,Speech and Signal Processing[C].USA:IEEE,2016.4945-4949.
[16] MARTINS A,ASTUDILLO R.From softmax to sparsemax:A sparse model of attention and multi-label classification[A].International Conference on Machine Learning[C].[S.l.]:[s.n.],2016.1614-1623.
[17] KIM Y.Convolutional neural networks for sentence classification[A].Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)[C].USA:Association for Computational Linguistics,2014.1746-1751.
[18] SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[A].Advances in Neural Information Processing Systems[C].[S.l.]:NIPS,2014.3104-3112.
[19] CHOROWSKI J K,et al.Attention-based models for speech recognition[A].Advances in Neural Information Processing Systems[C].[S.l.]:NIPS,2015.577-585.
[20] GODFREY J J,HOLLIMAN E C,MCDANIEL J.SWITCHBOARD:Telephone speech corpus for research and development[A].IEEE International Conference on Acoustics,Speech,and Signal Processing[C].USA:IEEE,1992.517-520.
[21] WENBIN J,PEILIN L,FEI W.Speech magnitude spectrum reconstruction from MFCCs using deep neural network[J].Chinese Journal of Electronics,2018,27(2):393-398.
[22] WEN M C,TIAN C H.The multi-weight neuron with geometry algorithm and its application[J].Chinese Journal of Electronics,2008,17(2):261-264.
[23] JI X U,JIE L P,YONG H Y.Agglutinative language speech recognition using automatic allophone deriving[J].Chinese Journal of Electronics,2016,25(2):134-139.
[24] KIM S,HORI T.Joint CTC-attention based end-to-end speech recognition using multi-task learning[A].IEEE International Conference on Acoustics,Speech and Signal Processing[C].USA:IEEE,2017.4835-4839.
[25] WATANABE S,HORI T,KIM S,et al.Hybrid CTC/attention architecture for end-to-end speech recognition[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1240-1253.