基于时频注意力Conformer的多尺度短语音说话人识别模型

杨璐; 张邦成; 杨俊美; 曾德炉

doi:10.12263/DZXB.20241114

您当前的位置：

首页 >

文章列表页 >

基于时频注意力Conformer的多尺度短语音说话人识别模型

学术论文 | 更新时间：2025-12-27

- 基于时频注意力Conformer的多尺度短语音说话人识别模型
- TFA-Conformer Based Network for Short Utterance Speaker Recognition
- 电子学报 2025年53卷第8期页码：2658-2667
- 作者机构：
  
  华南理工大学电子与信息学院，广东广州 510630
- 作者简介：
  
  [ "杨璐女，2000年生.现为华南理工大学电子与信息学院硕士研究生.研究方向为语音信号处理领域的声纹识别.E-mail: 202221013354@mail.scut.edu.cn" ]
  [ "张邦成男，2000年生.现为华南理工大学电子与信息学院硕士研究生.研究方向为语音信号处理领域的语音分离.E-mail: 202221013363@mail.scut.edu.cn" ]
  [ "杨俊美女，2009年3月至今在华南理工大学电子与信息学院任教.主要研究方向为智能信号处理、自适应滤波、图像超分辨率重建、语音去混响等.E-mail: yjunmei@scut.edu.cn" ]
  [ "曾德炉男，在华南理工大学电子与信息学院任教.研究方向为数学与信息等交叉理论及应用.E-mail: dlzeng@scut.edu.cn" ]
- 基金信息：
  
  广东省自然科学基金(2023A1515011281)
- DOI：10.12263/DZXB.20241114
  中图分类号： TP391
- 收稿：2024-12-11，
  
  录用：2025-05-29，
  
  纸质出版：2025-08-25
- 稿件说明：
移动端阅览
杨璐, 张邦成, 杨俊美, 等. 基于时频注意力Conformer的多尺度短语音说话人识别模型[J]. 电子学报, 2025, 53(08): 2658-2667.

YANG Lu, ZHANG Bang-cheng, YANG Jun-mei, et al. TFA-Conformer Based Network for Short Utterance Speaker Recognition[J]. Acta Electronica Sinica, 2025, 53(08): 2658-2667.
杨璐, 张邦成, 杨俊美, 等. 基于时频注意力Conformer的多尺度短语音说话人识别模型[J]. 电子学报, 2025, 53(08): 2658-2667. DOI：10.12263/DZXB.20241114

YANG Lu, ZHANG Bang-cheng, YANG Jun-mei, et al. TFA-Conformer Based Network for Short Utterance Speaker Recognition[J]. Acta Electronica Sinica, 2025, 53(08): 2658-2667. DOI：10.12263/DZXB.20241114

摘要

基于短语音的识别任务由于数据短缺、特征提取不精确，是说话人识别（Speaker Recognition，SR）领域目前面临的挑战之一.针对数据量匮乏场景下的短语音声纹特征提取和身份识别，本文设计了一种基于时频注意力和卷积增强的短语音说话人识别网络.本文在Transformer编码器中引入时频注意力和卷积，提出一种称为时频注意力Conformer（Time-Frequency Attention Convolution-augmented Transformer，TFA-Conformer）的模块，充分利用时频域通道中的信息来计算从全局到局部的有效性权重，帮助模型捕获精确的声学特征，使得特征编码器在短语音（3 s以内）环境下生成具有高判别性的说话人特征向量.本文在标准说话人数据集TIMIT和ST-CMDS上评估了所提出的有监督训练网络模型，在短语音条件下，其识别准确性等指标相比主流方法平均提升4.837%，并且在更短时间和更少数据量的语音段识别中有平均2.799%的相对提升.本文提出模型的参数更少且计算复杂度更低，其适用于短语音场景的同时也更轻量化.

Abstract

The recognition task based on short utterances is one of the challenges in the field of speaker recognition (SR) due to data scarcity and inaccurate feature extraction. In scenarios with limited data

this paper proposes a short utterance speaker recognition network based on time-frequency (T-F) attention and convolutional enhancement for feature extraction and identity recognition. We introduce a time-frequency attention module and a convolution module in the transformer encoder to propose a module called time-frequency attention conformer (TFA-Conformer)

which helps the model capture precise acoustic features by utilizing information from T-F channels to calculate validity weights from global to local perspectives

thereby enabling the feature encoder to produce highly discriminative speaker embeddings under short utterance speech conditions (3 s or less). We evaluate the proposed supervised training network on datasets under short utterance conditions

and the recognition accuracy and other metrics of the proposed method are improved by 4.837% on average

higher than those of the mainstream methods. In condition with shorter duration and less data

the proposed method shows a relative improvement of 2.799% on average. Furthermore

it requires fewer parameters and lower computational complexity

making it not only suitable for short utterance scenarios but also more lightweight.

关键词

Keywords

references

REYNOLDS D A , QUATIERI T F , DUNN R B . Speaker verification using adapted Gaussian mixture models [J ] . Digital Signal Processing , 2000 , 10 ( 1/2/3 ): 19 - 41 .

DEHAK N , KENNY P J , DEHAK R , et al . Front-end factor analysis for speaker verification [J ] . IEEE Transactions on Audio, Speech, and Language Processing , 2011 , 19 ( 4 ): 788 - 798 .

SNYDER D , GARCIA-ROMERO D , POVEY D , et al . Deep neural network embeddings for text-independent speaker verification [C ] // Interspeech 2017 . Los Angeles : ISCA , 2017 : 999 - 1003 .

SNYDER D , GARCIA-ROMERO D , SELL G , et al . X-vectors: Robust DNN embeddings for speaker recognition [C ] // 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2018 : 5329 - 5333 .

HE Y Y , KANG Z H , WANG J Z , et al . Voiceextender: Short-utterance text-independent speaker verification with guided diffusion model [C ] // 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) . Piscataway : IEEE , 2023 : 1 - 8 .

HU Z F , FU Y Q , LUO Y , et al . Speaker recognition based on short utterance compensation method of generative adversarial networks [J ] . International Journal of Speech Technology , 2020 , 23 ( 2 ): 443 - 450 .

BAEVSKI A , ZHOUH , MOHAMED A , et al . Wav2vec 2 . 0 : A framework for self-supervised learning of speech representations[EB/OL ] . ( 2020-10-22 )[ 2025-05-05 ] . https://arxiv.org/abs/2006.11477v3 https://arxiv.org/abs/2006.11477v3 .

LIU T C , DAS R K , LEE K A , et al . MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 7517 - 7521 .

ZI Y F , XIONG S W . Short-duration speaker verification by joint filter superposition-based multi-dimensional central difference feature extraction and Res2Block-based bidirectional sampling [J ] . IEEE Transactions on Consumer Electronics , 2024 , 70 ( 3 ): 5128 - 5141 .

WANG R , AO J Y , ZHOU L , et al . Multi-view self-attention based transformer for speaker recognition [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 6732 - 6736 .

ZHU Y K , MAK B . Bayesian self-attentive speaker embeddings for text-independent speaker verification [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2023 , 31 : 1000 - 1012 .

KARTHIKEYAN V , SUJA PRIYADHARSINI S . A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition [J ] . Pattern Analysis and Applications , 2024 , 27 ( 2 ): 48 .

ZHANG Y M , YU H , MA Z Y . Speaker verification system based on deformable CNN and time-frequency attention [C ] // 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) . Piscataway : IEEE , 2020 : 1689 - 1692 .

ZI Y F , XIONG S W . Multi-fisher and triple-domain feature enhancement-based short utterance speaker verification for IoT smart service [J ] . IEEE Internet of Things Journal , 2024 , 11 ( 4 ): 6044 - 6055 .

SNYDER D , GARCIA-ROMERO D , SELL G , et al . Speaker recognition for multi-speaker conversations using X-vectors [C ] // ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2019 : 5796 - 5800 .

VILLALBA J , CHEN N X , SNYDER D , et al . State-of-the-art speaker recognition for telephone and video speech: The JHU-MIT submission for NIST SRE18 [C ] // Interspeech 2019 . Los Angeles : ISCA , 2019 : 1488 - 1492 .

YU Y Q , LI W J . Densely connected time delay neural network for speaker verification [C ] // Interspeech 2020 . Los Angeles : ISCA , 2020 : 921 - 925 .

DESPLANQUES B , THIENPONDT J , DEMUYNCK K , et al . ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verificati-on [EB/OL ] . ( 2020-08-10 )[ 2025-05-05 ] . https://arxiv.org/abs/2005.07143v3 https://arxiv.org/abs/2005.07143v3 .

HU J , SHEN L , SUN G . Squeeze-and-excitation networ-ks [C ] // 2018 IEEE/CVFConference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 7132 - 7141 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [J ] . Advances in Neural Information Processing Systems , 2017 , 1 : 30 .

ZHU H N , LEE K A , LI H Z . Serialized multi-layer multi-head attention for neural speaker embedding [EB/OL ] . ( 2021-07-14 )[ 2025-05-05 ] . https://arxiv.org/abs/2107.06493v1 https://arxiv.org/abs/2107.06493v1 .

MARY N J M S , UMESH S , KATTA S V . S-vectors and TESA: Speaker embeddings and a speaker authenticator based on transformer encoder [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2021 , 30 : 404 - 413 .

WANG H L , ZOU Y X , CHONG D D , et al . Environmental sound classification with parallel temporal-spectral attenti-on [EB/OL ] . ( 2020-05-21 )[ 2025-06-05 ] . https://arxiv.org/abs/1912.06808v3 https://arxiv.org/abs/1912.06808v3 .

ZHANG Q Q , SONG Q , NI Z H , et al . Time-frequency attention for monaural speech enhancement [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 7852 - 7856 .

姜珊 , 张二华 , 张晗 . 基于Bi-GRU+BFE模型的短语音说话人识别 [J ] . 计算机与数字工程 , 2022 , 50 ( 10 ): 2233 - 2239 .

JIANG S , ZHANG E H , ZHANG H . Speaker recognition under short utterance based on Bi-GRU+BFE model [J ] . Computer & Digital Engineering , 2022 , 50 ( 10 ): 2233 - 2239 . (in Chinese)

LI C , MA X , JIANG B , et al . Deep speaker: An end-to-end neural speaker embedding system [EB/OL ] . ( 2017-05-05 )[ 2025-05-05 ] . https://arxiv.org/abs/1705.02304 https://arxiv.org/abs/1705.02304 .

ZHANG Y , LV Z Q , WU H B , et al . MFA-conformer: Multi-scale feature aggregation conformer for automatic speaker verification [EB/OL ] . ( 2022-11-11 )[ 2025-05-05 ] . https://arxiv.org/abs/2203.15249v2 https://arxiv.org/abs/2203.15249v2 .

CHANG O , LIAO H , SERDYUK D , et al . Conformer is all you need for visual speech recognition [C ] // ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2024 : 10136 - 10140 .

ZHANG C , CHEN W , XU C . Depthwise separable convolutions for short utterance speaker identification [C ] // 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC) . Piscataway : IEEE , 2019 : 962 - 966 .

ZHEN YANG , TIANLANG WANG , HAIYAN GUO , et al . Speaker verification method based on cross-domain attentive feature fusion [J ] . Journal on Communications , 2023 , 44 ( 8 ): 89 - 98 .

ZHANG Q Q , QIAN X Y , NI Z H , et al . A time-frequency attention module for neural speech enhancement [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2022 , 31 : 462 - 475 .

HAJAVI A , ETEMAD A . A study on bias and fairness in deep speaker recognition [C ] // ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2023 : 1 - 5 .

GAO S H , CHENG M M , ZHAO K , et al . Res2Net: A new multi-scale backbone architecture [J ] . IEEE TransPattern AnalMachIntell , 2021 , 43 ( 2 ): 652 - 662 .

LU Y P , LI Z H , HE D , et al . Understanding and improving transformer from a multi-particle dynamic system point of view [EB/OL ] . ( 2019-06-06 )[ 2025-05-05 ] . https://arxiv.org/abs/1906.02762v1 https://arxiv.org/abs/1906.02762v1 .

HINTON G E , SRIVASTAVA N , KRIZHEVSKY A , et al . Improving neural networks by preventing co-adaptation of feature detectors [EB/OL ] . ( 2012-07-03 )[ 2025-05-05 ] . https://arxiv.org/abs/1207.0580v1 https://arxiv.org/abs/1207.0580v1 .

NOVOSELOV S , VOLOKHOV V , LAVRENTYEVA G . Universal speaker recognition encoders for different speech segments duration [C ] // ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2023 : 1 - 5 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

DMR-KAN：基于多尺度区域强化的三维肿瘤影像分割方法

ConvFormer：基于Transformer的视觉主干网络

图像压缩感知的特征域优化及自注意力增强神经网络重构算法

SAU-Net：基于U-Net和自注意力机制的医学图像分割方法