基于特征空间轨迹信息的语音关键词检测方法

田颖慧; 贺前华; 郑若伟; 危卓; 李艳雄

doi:10.12263/DZXB.20220289

您当前的位置：

首页 >

文章列表页 >

基于特征空间轨迹信息的语音关键词检测方法

学术论文 | 更新时间：2025-12-08

- 基于特征空间轨迹信息的语音关键词检测方法
- Spoken Term Detection Based on Feature Space Trajectory Information
- 电子学报 2023年51卷第10期页码：2915-2924
- 作者机构：
  
  华南理工大学，广东广州 510641
- 作者简介：
  
  [ "田颖慧女，1997年生，河南驻马店人.华南理工大学硕士研究生.主要研究方向为语音信号处理、语音关键词检.E-mail: 13174416712@163.com" ]
  [ "贺前华（通讯作者）男，1965年生，湖南邵东人.博士.华南理工大学教授、博士生导师.主要研究方向为智能音频信号处理、语音识别和说话人识别." ]
  [ "郑若伟男，1998年生，广东汕头人.华南理工大学硕士研究生.主要研究方向为语音关键词检测、语音识别.E-mail: ruoweizheng@foxmail.com" ]
  [ "危卓女，1997年生，湖南岳阳人.华南理工大学硕士研究生.主要研究方向为语音信号处理、说话人识别.E-mail: 201921011738@mail.scut.edu.cn" ]
  [ "李艳雄男，1980年生，湖南嘉禾人.博士.华南理工大学副教授、博士生导师.主要研究方向为语音及音频信号处理、机器学习.E-mail: eeyxli@scut.edu.cn" ]
- 基金信息：
  
  广东省自然科学基金(2022A1515011687);国家自然科学基金(61571192)
- DOI：10.12263/DZXB.20220289
  中图分类号： TP391.4;TP391.9
- 收稿：2022-03-22，
  
  修回：2022-08-26，
  
  纸质出版：2023-10-25
- 稿件说明：
移动端阅览
田颖慧,贺前华,郑若伟等.基于特征空间轨迹信息的语音关键词检测方法[J].电子学报,2023,51(10):2915-2924.

TIAN Yin-hui,HE Qian-hua,ZHENG Ruo-wei,et al.Spoken Term Detection Based on Feature Space Trajectory Information[J].ACTA ELECTRONICA SINICA,2023,51(10):2915-2924.
田颖慧,贺前华,郑若伟等.基于特征空间轨迹信息的语音关键词检测方法[J].电子学报,2023,51(10):2915-2924. DOI： 10.12263/DZXB.20220289.

TIAN Yin-hui,HE Qian-hua,ZHENG Ruo-wei,et al.Spoken Term Detection Based on Feature Space Trajectory Information[J].ACTA ELECTRONICA SINICA,2023,51(10):2915-2924. DOI： 10.12263/DZXB.20220289.

摘要

当前语音关键词检测的主流技术为深度学习，需要大规模标注样本进行训练，难以应用于更普遍的低资源场景.本文提出一种基于音频特征空间轨迹信息的低资源语音关键词检测方法，该方法基于“词是由更小语言单元（音节、音素）的结构化组成，以及语言单元声学特征具有稳定性（统计意义）”的事实，结合物理几何空间定位的原理，构建语音关键词的特征空间表达、时序信息表达和局部区分信息知识.语音关键词检测时，依据语音段的特征空间轨迹信息分层次进行判决，实现了模式信息与统计信息的综合应用.其中语音特征空间是利用丰富的无标注语音样本构建音频特征空间的标识子表达，而语音关键词的特征空间轨迹信息利用少量关键词语音样本构建.多个实验结果表明，本文算法在低资源时（100个样本以下），相比HMM和CRNN有显著优势，10个训练样本时，相比HMM，FRR绝对下降了20.5%，FAR绝对下降了8.7

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51176558&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51176556&type=

6.01133299

2.28600001

；而在训练样本量较充分（300个样本及以上）时，与CRNN有大致相当的性能.

Abstract

The current technique of spoken term detection is dominated by deep learning

which requires large annotated data for training

and is difficult to be applied in limited-data scenarios. In this paper

a feature trajectory based method of spoken term detection is proposed for limited-data scenarios. The method originated from the fact that a word is a structured organization of small units such as syllable or phoneme and any language unit has steady statistical audio feature

based on the principle of physical location

feature distribution

temporal information of keywords

and local distinguishing information are constructed with speech examples. Spoken keywords are searched with the feature trajectory information of the detected speech segment in hierarchical decision strategy. The method works on a audio feature space defined by a identifier set trained with a large unlabeled speech dataset. Several experimental results show that the proposed method is evidently superior to HMM and CRNN when the training samples is less than 100. For example

when 10 samples are used for training

FRR and FAR of the propose method are absolutely decreased by 20.5% and 8.7

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51176596&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=51176584&type=

10.24466610

2.28600001

respectively compared with HMM-based system. On the other hand

the proposed method achieved the comparable performance v.s. CRNN-based system when the training samples is more than 300.

关键词

Keywords

references

SANGEETHA J , JOTHILAKSHMI S . A novel spoken document retrieval system using auto associative neural network based keyword spotting [C]// 2015 IEEE 9th International Conference on Intelligent Systems and Control (ISCO) . Piscataway : IEEE , 2015 : 1 - 6 .

刘俊华 . 面向多语种海量数据的语音关键词检索方法研究与系统实现 [D]. 合肥 : 中国科学技术大学 , 2019 .

LIU J H . Research and System Implementation of Speech Keyword Retrieval Method for Multilingual Massive Data [D]. Hefei : University of Science and Technology of China , 2019 . (in Chinese)

KAVYA H P , KARJIGI V . Sensitive keyword spotting for crime analysis [C]// 2014 IEEE National Conference on Communication, Signal Processing and Networking (NCCSN) . Piscataway : IEEE , 2015 : 1 - 6 .

CHEN G G , PARADA C , HEIGOLD G . Small-footprint keyword spotting using deep neural networks [C]// 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2014 : 4087 - 4091 .

MICHAELY A H , ZHANG X D , SIMKO G , et al . Keyword spotting for Google assistant using contextual speech recognition [C]// 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) . Piscataway : IEEE , 2018 : 272 - 278 .

WEINTRAUB M . LVCSR log-likelihood ratio scoring for keyword spotting [C]// 1995 International Conference on Acoustics, Speech, and Signal Processing . Piscataway : IEEE , 1995 : 297 - 300 .

ROSENBERG A , AUDHKHASI K , SETHY A , et al . End-to-end speech recognition and keyword search on low-resource languages [C]// 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2017 : 5280 - 5284 .

唐海桃 , 薛嘉宾 , 韩纪庆 . 一种多尺度前向注意力模型的语音识别方法 [J]. 电子学报 , 2020 , 48 ( 7 ): 1255 - 1260 .

TANG H T , XUE J B , HAN J Q . A method of multi-scale forward attention model for speech recognition [J]. Acta Electronica Sinica , 2020 , 48 ( 7 ): 1255 - 1260 . (in Chinese)

ROSE R C , PAUL D B . A hidden Markov model based keyword recognition system [C]// International Conference on Acoustics, Speech, and Signal Processing . Piscataway : IEEE , 1990 : 129 - 132 .

ZHANG S L , SHUANG Z W , SHI Q , et al . Improved mandarin keyword spotting using confusion garbage model [C]// 2010 20th International Conference on Pattern Recognition . Piscataway : IEEE , 2010 : 3700 - 3703 .

CHEN Q Y , ZHANG W B , XU X M , et al . Improved keyword spotting based on keyword/garbage models [C]// 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) . Piscataway : IEEE , 2017 : 1 - 4 .

SIGTIA S , HAYNES R , RICHARDS H , et al . Efficient voice trigger detection for low resource hardware [C]// Interspeech 2018 . Baixas : ISCA , 2018 : 2092 - 2096 .

SHRIVASTAVA A , KUNDU A , DHIR C , et al . Optimize what matters: Training DNN-hmm keyword spotting model using end metric [C]// ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2021 : 4000 - 4004 .

ARIK S O , KLIEGL M , CHILD R , et al . Convolutional recurrent neural networks for small-footprint keyword spotting [EB/OL]. ( 2017-03-15 )[ 2022-03 ]. https://arxiv.org/abs/1703.05390 https://arxiv.org/abs/1703.05390 .

WANG Y Y , LONG Y H . Keyword spotting based on CTC and RNN for mandarin Chinese speech [C]// 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) . Piscataway : IEEE , 2019 : 374 - 378 .

YAN H K , HE Q H , XIE W . Crnn-ctc based mandarin keywords spotting [C]// ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2020 : 7489 - 7493 .

MADHAVI M C , PATIL H A . Vocal Tract Length Normalization using a Gaussian mixture model framework for query-by-example spoken term detection [J]. Computer Speech & Language , 2019 , 58 : 175 - 202 .

BENISTY H , KATZ I , CRAMMER K , et al . Discriminative Keyword Spotting for limited-data applications [J]. Speech Communication , 2018 , 99 : 1 - 11 .

贺前华 , 田颖慧 , 兰小添 , 等 . 一种基于运动轨迹和区分性信息的语音关键词检测方法 : CN114373453A [P]. 2022-04-19 .

HE Q H , TIAN Y H , LAN X T , et al . Voice keyword detection method based on motion trail and distinguishing information : CN114373453A [P]. 2022-04-19 . (in Chinese) .

KHELIFA M O M , ELHADJ Y M , ABDELLAH Y , et al . Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system [J]. International Journal of Speech Technology , 2017 , 20 ( 4 ): 937 - 949 .

唐宇政 . 基于欧式距离的判别分析: 鸢尾花分类问题探究 [J]. 现代商贸工业 , 2019 , 40 ( 9 ): 183 - 185 .

TANG Y Z . Discriminant analysis based on euclidean distance-Research on iris classification [J]. Modern Business Trade Industry , 2019 , 40 ( 9 ): 183 - 185 . (in Chinese)

贺前华 , 苏健彬 , 严海康 , 等 . 一种基于语谱图时间差分的语音音节数估计方法 : CN111063371A [P]. 2023-04-21 .

HE Q H , SU J B , YAN H K , et al . Speech Syllable Number Estimation Method Based on Spectrogram Time Difference : CN111063371A [P]. 2023-04-21 . (in Chinese) .

NOBLE W S . What is a support vector machine? [J]. Nature Biotechnology , 2006 , 24 ( 12 ): 1565 - 1567 .

BREIMAN L . Bagging predictors [J]. Machine Learning , 1996 , 24 ( 2 ): 123 - 140 .

BU H , DU J Y , NA X Y , et al . AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline [C]// 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) . Piscataway : IEEE , 2018 : 1 - 5 .

DU J Y , NA X Y , LIU X C , et al . AISHELL-2: Transforming mandarin ASR research into industrial scale [EB/OL]. ( 2018-08-31 )[ 2022-03 ]. https://arxiv.org/abs/1808.10583 https://arxiv.org/abs/1808.10583 .

WARDEN P . Speech commands: A dataset for limited-vocabulary speech recognition [EB/OL]. ( 2018-04-09 )[ 2022-03 ]. https://arxiv.org/abs/1804.03209 https://arxiv.org/abs/1804.03209 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据