End-to-End Speech Keyword Spotting Training Method Based on Sample's Class Uncertainty

HE Qian-hua; CHEN Yong-qiang; ZHENG Ruo-wei; HUANG Jin-xin

doi:10.12263/DZXB.20240048

您当前的位置：

首页 >

文章列表页 >

End-to-End Speech Keyword Spotting Training Method Based on Sample's Class Uncertainty

PAPERS | 更新时间：2026-05-07

- End-to-End Speech Keyword Spotting Training Method Based on Sample's Class Uncertainty
- ACTA ELECTRONICA SINICA Vol. 52, Issue 10, Pages: 3482-3492(2024)
- 作者机构：
  
  华南理工大学电子与信息学院，广东广州 510641
- 作者简介：
- 基金信息：
  
  Guangdong Science Foundation(2023A0505050116;2022A1515011687);The National Nature Science Foundation of China(62371195)
- DOI：10.12263/DZXB.20240048
  CLC： TN912;TP391
- Received：10 January 2024，
  
  Revised：2024-03-08，
  
  Published：25 October 2024
- 稿件说明：
移动端阅览
贺前华, 陈永强, 郑若伟, 等. 基于样本类不确定性抽样的端到端语音关键词检测训练方法[J]. 电子学报, 2024, 52(10): 3482-3492.

HE Qian-hua, CHEN Yong-qiang, ZHENG Ruo-wei, et al. End-to-End Speech Keyword Spotting Training Method Based on Sample's Class Uncertainty[J]. Acta Electronica Sinica, 2024, 52(10): 3482-3492.
贺前华, 陈永强, 郑若伟, 等. 基于样本类不确定性抽样的端到端语音关键词检测训练方法[J]. 电子学报, 2024, 52(10): 3482-3492. DOI：10.12263/DZXB.20240048

HE Qian-hua, CHEN Yong-qiang, ZHENG Ruo-wei, et al. End-to-End Speech Keyword Spotting Training Method Based on Sample's Class Uncertainty[J]. Acta Electronica Sinica, 2024, 52(10): 3482-3492. DOI：10.12263/DZXB.20240048

摘要

当前语音关键词检测主流技术为端到端的深度学习方法，研究重点为网络结构优化、建模单元选取及搜索策略等，并取得较快进展，但对模型训练效率的关注相对较少.本文针对深度学习模型训练效率问题，提出了一种样本类不确定性抽样（Class Uncertainty Sampling，CUS）的样本应用策略加速收敛进程.其核心思想是在模型训练中后期，利用网络的前向输出层对样本评价信息进行样本类不确定性度量，并转化成样本选用概率，随机抽取训练样本子集用于后续训练.由于简单样本的类确定度高，它们参与后续训练的概率降低，但不影响模型的区分能力，增强对判决边界样本的关注，达到提高模型训练效率的目标.基于AISHELL-1普通话数据集的实验结果表明，相对常规训练策略，平均训练时长缩短60%，收敛时长缩短47.5%.虚警率（False Alarm Rate，FAR）为0.5 FP/h时，该方法的错误拒绝率（False Reject Rate，FRR）从4.75%降至3.65%，相对下降30.1%，最大关键词加权值（Maximum Term Weighted Value，MTWV）由0.837 4升至0.853 1.通过分析错标样本参与训练的行为，证实了该方法具有屏蔽掉大部分错误标注样本的能力，减少错标样本对训练的损害.基于大规模AISHELL-2普通话数据集的实验进一步证实了提出方法的有效性.

Abstract

End-to-end deep learning is the main technology for speech keyword spotting. The research focused on exploring better network structures

modeling units

and search strategies

and have made a lot of progress. However

less attention is paid on training efficiency. In this paper

a novel class uncertainty sampling (CUS) strategy is proposed to select effective samples for each training epoch. Since only a subset is used

much training time is saved. The core idea of CUS is measuring the class uncertainty of samples with the forward information of the output layer during the middle and late training stages

and samples are selected at a probability of their class uncertainty. Therefore more attention is paid to samples nearing the decision boundary

which are prone to missed detection or false alarm. Furthermore

the proposed method could shield the interference of label error samples. Experimental results on the AISHELL-1 Mandarin dataset showed that fast convergence and better training performance were achieved. Against the conventional training strategy

the average training time and the average converging time was relatively shortened by 60% and 47.5%

respectively. At 0.5 FP/h false accept rate(FAR)

the false reject rate (FRR) was reduced from 4.75% to 3.65%

a relative reduction of 30.1%

and the maximum term weighted value (MTWV) was increased from 0.837 4 to 0.853 1. Moreover

it was experimentally verified that the method could shield most of the mislabeled samples. This conclusion was confirmed with the experiments on the large-scale AISHELL-2 Mandarin dataset.

关键词

Keywords

references

LIU W C , HUANG Z H , WANG D F . Keyword spotting based on efficient neural architecture search [C ] // 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML) . Piscataway : IEEE , 2023 : 432 - 436 .

KUROKAWA T , KAI A . Robust query-by-example spoken term detection for unknown words using speech retrieval-oriented E2E ASR modeling [C ] // 2021 IEEE 10th Global Conference on Consumer Electronics (GCCE) . Piscataway : IEEE , 2021 : 316 - 317 .

NA Y Y , WANG Z T , WANG L , et al . Joint ego-noise suppression and keyword spotting on sweeping robots [C ] // ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 7547 - 7551 .

LI M R . A lightweight architecture for query-by-example keyword spotting on low-power IoT devices [J ] . IEEE Transactions on Consumer Electronics , 2023 , 69 ( 1 ): 65 - 75 .

田颖慧 , 贺前华 , 郑若伟 , 等 . 基于特征空间轨迹信息的语音关键词检测方法 [J ] . 电子学报 , 2023 , 51 ( 10 ): 2915 - 2924 .

TIAN Y H , HE Q H , ZHENG R W , et al . Spoken term detection based on feature space trajectory information [J ] . Acta Electronica Sinica , 2023 , 51 ( 10 ): 2915 - 2924 . (in Chinese)

CHEN G G , PARADA C , HEIGOLD G . Small-footprint keyword spotting using deep neural networks [C ] // 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2014 : 4087 - 4091 .

TIAN Y , YAO H T , CAI M , et al . Improving RNN transducer modeling for small-footprint keyword spotting [C ] // ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2021 : 5624 - 5628 .

PETER D , ROTH W , PERNKOPF F . End-to-end keyword spotting using neural architecture search and quantization [C ] // ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 3423 - 3427 .

CHOROWSKI J K , BAHDANAU D , SERDYUK D , et al . Attention-based models for speech recognition [C ] // 28th International Conference on Neural Information Processing Systems (NIPS) . Montreal : MIT Press , 2015 : 577 - 585 .

GRAVES A , FERNÁNDEZ S , GOMEZ F , et al . Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks [C ] // Proceedings of the 23rd international conference on Machine learning-ICML . New York : ACM , 2006 : 1143844 .

SHAN C H , ZHANG J B , WANG Y J , et al . Attention-based end-to-end models for small-footprint keyword spotting [C ] // Interspeech 2018 . Baixas : ISCA , 2018 : 2037 - 2041 .

HIGUCHIL T , GUPTA A , DHIR C . Multi-task learning with cross attention for keyword spotting [C ] // 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) . Piscataway : IEEE , 2021 : 571 - 578 .

BAI Y , YI J Y , NI H , et al . End-to-end keywords spotting based on connectionist temporal classification for Mandarin [C ] // 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP) . Piscataway : IEEE , 2016 : 1 - 5 .

YAN H K , HE Q H , XIE W . CRNN-CTC based mandarin keywords spotting [C ] // ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2020 : 7489 - 7493 .

LAN X T , HE Q H , YAN H K , et al . A novel re-weighted CTC loss for data imbalance in speech keyword spotting [J ] . Chinese Journal of Electronics , 2023 , 32 ( 3 ): 465 - 473 .

SHIN D , KIM G , JO J , et al . Low complexity gradient computation techniques to accelerate deep neural network training [J ] . IEEE Transactions on Neural Networks and Learning Systems , 2023 , 34 ( 9 ): 5745 - 5759 .

NIKOLOV M , TSENOV G , NAKOV O , et al . Application of GPU accelerated deep learning neural networks for COVID-19 recognition from X-ray scans [C ] // 2022 10th International Scientific Conference on Computer Science (COMSCI) . Piscataway : IEEE , 2022 : 1 - 5 .

KIRTAS M , PASSALIS N , TEFAS A . Multiplicative update rules for accelerating deep learning training and increasing robustness [J ] . Neurocomputing , 2024 , 576 : 127352 .

ZHANG C , ÖZTIRELI C , MANDT S , et al . Active mini-batch sampling using repulsive point processes [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Washington : AAAI , 2019 , 33 ( 1 ): 5741 - 5748 .

PENG X Y , LI L , WANG F Y . Accelerating minibatch stochastic gradient descent using typicality sampling [J ] . IEEE Transactions on Neural Networks and Learning Systems , 2020 , 31 ( 11 ): 4649 - 4659 .

JIANG A H , WONG D L K , ZHOU G , et al . Accelerating deep learning by focusing on the biggest losers [EB/OL ] . ( 2019-10-02 )[ 2024-01-10 ] . http://arxiv.org/abs/1910.00762 http://arxiv.org/abs/1910.00762 .

CAO R Y . Towards accelerated and robust rreinforcement learning with transfer learning [C ] // 2022 International Conference on Big Data, Information and Computer Network (BDICN) . Piscataway : IEEE , 2022 : 335 - 340 .

WANG X , CHEN Y D , ZHU W W . A survey on curriculum learning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 44 ( 9 ): 4555 - 4576 .

BU H , DU J Y , NA X Y , et al . AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline [C ] // 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment . Piscataway : IEEE , 2017 : 1 - 5 .

DU J Y , NA X Y , LIU X C , et al . AISHELL-2: Transforming mandarin ASR research into industrial scale [EB/OL ] . ( 2018-09-13 )[ 2024-01-10 ] . http://arxiv.org/abs/1808.10583 http://arxiv.org/abs/1808.10583 .

WEGMANN S , FARIA A , JANIN A , et al . The TAO of ATWV: Probing the mysteries of keyword search performance [C ] // 2013 IEEE Workshop on Automatic Speech Recognition and Understanding . Piscataway : IEEE , 2013 : 192 - 197 .

VAN DER MAATEN L , HINTON G . Visualizing data using t-SNE [J ] . Journal of Machine Learning Research , 2008 , 9 ( 11 ): 2579 - 2605 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

SDDA: Unsupervised Style and Distribution Domain Adaptation Method for Nighttime Semantic Segmentation

Textual Semantic Guidance for Infrared and Visible Image Fusion

Text Prompted Image Coding for Machine

Transformer-Based Modulation Recognition with Mirroring Data Augmentation and Multi-Scale Convolutional Feature Fusion

Related Author

LEI Xiaochun

WU Weilin

JIANG Zetao

ZHU Wencai

LIU Yingjian

CHEN Dongmei

WU Siqi

ZHU Mingrui

Related Institution

School of Computer and Information Security, Guilin University of Electronic Technology

Guangxi Key Laboratory of Image and Graphics Intelligent Processing

School of Computer Science, Northwestern Polytechnical University

State Key Laboratory of Integrated Services Networks, Xidian University

School of Computer Science, Peking University

⁰