Speech Large Language Models: Architecture, Training and Challenges Analysis

ZHANG Ya-zhou; LIU Qi-meng; RONG Lu; ZHAO Bin; LI Ai-jun

doi:10.12263/DZXB.20250367

您当前的位置：

首页 >

文章列表页 >

Speech Large Language Models: Architecture, Training and Challenges Analysis

SURVEYS AND REVIEWS | 更新时间：2025-12-27

- Speech Large Language Models: Architecture, Training and Challenges Analysis
- ACTA ELECTRONICA SINICA Vol. 53, Issue 9, Pages: 3454-3472(2025)
- 作者机构：
  
  1.郑州轻工业大学软件学院，河南郑州 450000
  2.天津大学智能与计算学部，天津 300350
  3.天津大学教育学院，天津 300350
  4.中国社会科学院语言研究所，北京 102488
- 作者简介：
- 基金信息：
- DOI：10.12263/DZXB.20250367
  CLC： TP391;
- Received：08 May 2025，
  
  Accepted：25 August 2025，
  
  Published：25 September 2025
- 稿件说明：
移动端阅览
张亚洲, 刘祈蒙, 戎璐, 等. 语音大模型：架构、训练与挑战分析[J]. 电子学报, 2025, 53(09): 3454-3472.

ZHANG Ya-zhou, LIU Qi-meng, RONG Lu, et al. Speech Large Language Models: Architecture, Training and Challenges Analysis[J]. Acta Electronica Sinica, 2025, 53(09): 3454-3472.
张亚洲, 刘祈蒙, 戎璐, 等. 语音大模型：架构、训练与挑战分析[J]. 电子学报, 2025, 53(09): 3454-3472. DOI：10.12263/DZXB.20250367

ZHANG Ya-zhou, LIU Qi-meng, RONG Lu, et al. Speech Large Language Models: Architecture, Training and Challenges Analysis[J]. Acta Electronica Sinica, 2025, 53(09): 3454-3472. DOI：10.12263/DZXB.20250367

摘要

大型语言模型（Large Language Models，LLMs）凭借其卓越的指令跟随能力与上下文学习能力在众多下游自然语言处理（Natural Language Processing，NLP）任务上取得巨大成功.鉴于人类智能的多模态属性，这种研究热态自然地蔓延到其他模态，特别是视觉模态和语音模态.在视觉领域，以GPT-4V、LLaVa为代表的视觉大模型使用基础语言模型作为“大脑”执行视觉理解和视觉推理任务，展现出跨越 “任务壁垒”的能力.对比而言，语音大模型（Speech Large Language Models，SLLMs）研究同样受到学术界与工业界的高度关注.涌现出以Whisper、Qwen-Audio为代表的一系列模型，在语音识别、语音理解和语音合成等任务上不断突破性能边界，展现出令人瞩目的发展潜力.本文旨在系统梳理和总结语音大模型的最新研究进展.文章深入阐述语音大模型的基本框架，并详尽探讨相关核心概念，包括模型组件、训练策略、数据构建以及评估方法.在此基础上，本文进一步分析了当前研究中的主要挑战，并展望了未来可能的发展方向.

Abstract

Large language models (LLMs) have achieved outstanding success across a wide range of downstream tasks in natural language processing (NLP)

thanks to their remarkable ability to follow instructions and learn from context.As human intelligence is inherently multimodal

the momentum of this research has naturally expanded into other modalities

particularly vision and speech. In the realm of vision

large-scale models like GPT-4V and LLaVa employ foundational language models as the “brain” enabling them to perform complex tasks in visual understanding and reasoning. These models have shown impressive abilities to break down task barriers

transcending traditional boundaries in vision-related tasks. In a similar vein

speech large language models (SLLMs) have attracted significant interest from both academia and industry. Notable models such as Whisper and Qwen-Audio have emerged as frontrunners

setting new performance records in speech-related tasks

including speech recognition

understanding

and synthesis. Their development demonstrates significant potential for further breakthroughs. This paper aims to provide a comprehensive review of the latest advancements in SLLMs research. It delves into the foundational architecture of these models

thoroughly exploring key concepts such as model components

training strategies

data construction

and evaluation methods. Furthermore

it addresses the primary challenges that researchers face in this rapidly evolving field and discusses possible future directions for research and development in speech-based large models.

关键词

Keywords

references

LI C , WONG C , ZHANG S , et al . Llava-med: Training a large language-and-vision assistant for biomedicine in one day [J ] . Advances in Neural Information Processing Systems , 2023 , 36 : 28541 - 28564 .

HUANG D W , YAN C , LI Q , et al . From large language models to large multimodal models: A literature review [J ] . Applied Sciences , 2024 , 14 ( 12 ): 5068 .

CAO N , LIN Y R , SUN X H , et al . Whisper: Tracing the spatiotemporal process of information diffusion in real time [J ] . IEEE Transactions on Visualization and Computer Graphics , 2012 , 18 ( 12 ): 2649 - 2658 .

HORI T , MORITZ N , HORI C , et al . Transformer-based long-context end-to-end speech recognition [C ] // Interspeech 2020 . Barcelona : ISCA , 2020 : 5011 - 5015 .

STRIK H , CUCCHIARINI C . Modeling pronunciation variation for ASR: A survey of the literature [J ] . Speech Communication , 1999 , 29 ( 2/3/4 ): 225 - 246 .

AMODEI D , ANANTHANARAYANAN S , ANUBHAI R , ET AL . Deep speech 2: End-to-end speech recognition in english and mandarin [C ] // Proceedings of the International Conference on Machine Learning (PMLR) . Cambridge : PMLR , 2016 : 173 - 182 .

SHEN J , PANG R M , WEISS R J , et al . Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions [C ] // 2018 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2018 : 4779 - 4783 .

RADFORD A , KIM J W , XU T , et al . Robust speech recognition via large-scale weak supervision [C ] // Proceedings of the 40th International Conference on Machine Learning . New York : ACM , 2023 : 28492 - 28518 .

CHEN S Y , WANG C Y , WU Y , et al . Neural codec language models are zero-shot text to speech synthesizers [J ] . IEEE Transactions on Audio, Speech and Language Processing , 2025 , 33 : 705 - 718 .

ZHANG Z Q , CHEN S Y , ZHOU L , et al . SpeechLM: Enhanced speech pre-training with unpaired textual data [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2024 , 32 : 2177 - 2187 .

LU X Y , YAN Y P , KANG B , et al . ParaFormer: Parallel attention transformer for efficient feature matching [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2023 , 37 ( 2 ): 1853 - 1860 .

HSU W N , BOLTE B , TSAI Y H , et al . HuBERT: Self-supervised speech representation learning by masked prediction of hidden units [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2021 , 29 : 3451 - 3460 .

CHEN S Y , WANG C Y , CHEN Z Y , et al . WavLM: Large-scale self-supervised pre-training for full stack speech processing [J ] . IEEE Journal of Selected Topics in Signal Processing , 2022 , 16 ( 6 ): 1505 - 1518 .

CHEN S Y , WU Y , WANG C Y , et al . Unispeech-sat: Universal speech representation learning with speaker aware pre-training [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2022 : 6152 - 6156 .

WANG C , LIAO M P , HUANG Z Q , et al . BLSP-emo: Towards empathetic large speech-language models [C ] // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing . Stroudsburg : ACL , 2024 : 19186 - 19199 .

GONG Y , LIU A H , LUO H Y , et al . Joint audio and speech understanding [C ] // 2023 IEEE Automatic Speech Recognition and Understanding Workshop . Piscataway : IEEE , 2024 : 1 - 8 .

FITZGERALD J , ANANTHAKRISHNAN S , ARKOUDAS K , et al . Alexa teacher model: Pretraining and distilling multi-billion-parameter encoders for natural language understanding systems [C ] // Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . New York : ACM , 2022 : 2893 - 2902 .

REN Y , HU C X , TAN X , et al . FastSpeech 2: Fast and high-quality end-to-end text to speech [EB/OL ] . ( 2022-08-08 )[ 2025-03-20 ] . https://arXiv.org/abs/2006.04558 https://arXiv.org/abs/2006.04558 .

KIM J , KIM S , KONG J , et al . Glow-tts: A generative flow for text-to-speech via monotonic alignment search [C ] // Advances in Neural Information Processing Systems 33 . San Diego : NeurIPS , 2020 : 8067 - 8077 .

SHEN K , JU Z , TAN X , et al . NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers [EB/OL ] . ( 2023-05-30 )[ 2025-03-20 ] . https://arXiv.org/abs/2304.09116 https://arXiv.org/abs/2304.09116 .

HU H R , SONG Y , ZHANG J T , et al . Stargan-vc based cross-domain data augmentation for speaker verification [C ] // ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 1 - 5 .

AKBARI H , YUAN L , QIAN R , et al . VATT: Transformers for multimodal self-supervised learning from raw video, audio and text [C ] // Advances in Neural Information Processing Systems 34 . San Diego : NeurIPS , 2021 : 24206 - 24221 .

GUZHOV A , RAUE F , HEES J , et al . Audioclip: Extending clip to image, text and audio [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2022 : 976 - 980 .

CHUANG Y S , LIU C-L , LEE H Y , et al . SpeechBERT: An audio-and-text jointly learned language model for end-to-end spoken question answering [C ] // Interspeech 2020 . Barcelona : ISCA , 2020 : 4168 - 4172 .

AO J Y , WANG R , ZHOU L , et al . SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing [C ] // Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : ACL , 2022 : 5723 - 5738 .

DESHMUKH S , ELIZALDE B , SINGH R , et al . Pengi: An audio language model for audio tasks [J ] . Advances in Neural Information Processing Systems 36. San Diego: NeurIPS , 2023 : 18090 - 18108 .

TANG C L , YU W Y , SUN G Z , et al . SALMONN: Towards generic hearing abilities for large language models [EB/OL ] . ( 2024-04-08 )[ 2025-03-20 ] . https://arXiv.org/abs/2310.13289 https://arXiv.org/abs/2310.13289 .

ZHANG D , LI S M , ZHANG X , et al . SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities [C ] // Findings of the Association for Computational Linguistics: EMNLP 2023 . Stroudsburg : ACL , 2023 : 15757 - 15773 .

KONG Z , GOEL A , BADLANI R , et al . Audio Flamingo: A novel audio language model with few-shot learning and dialogue abilities [C ] // International Conference on Machine Learning . Cambridge : PMLR , 2024 : 25125 - 25148 .

LE M , VYAS A , SHI B , et al . Voicebox: Text-guided multilingual universal speech generation at scale [C ] // Advances in Neural Information Processing Systems 36 . San Diego : NeurIPS , 2023 : 14005 - 14034 .

CHANG Y P , WANG X , WANG J D , et al . A survey on evaluation of large language models [J ] . ACM Transactions on Intelligent Systems and Technology , 2024 , 15 ( 3 ): 1 - 45 .

NAVEED H , KHAN A U , QIU S , et al . A comprehensive overview of large language models [J ] . ACM Transactions on Intelligent Systems and Technology , 2025 , 16 ( 5 ): 1 - 72 .

KONG J , KIM J , BAE J . Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis [C ] // Advances in Neural Information Processing Systems 33 . San Diego : NeurIPS , 2020 : 17022 - 17033 .

MAITI S , PENG Y F , CHOI S , et al . VoxtLM: Unified decoder-only models for consolidating speech recognition, synthesis and speech, text continuation tasks [C ] // ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2024 : 13326 - 13330 .

GAIDO M , PAPI S , NEGRI M , et al . Speech translation with speech foundation models and large language models: What is there and what is missing? [C ] // Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics . Stroudsburg : ACL , 2024 : 14760 - 14778 .

DOMINGUEZ-OLMEDO R , HARDT M , MENDLER-DÜNNER C . Questioning the survey responses of large language models [C ] // Advances in Neural Information Processing Systems 37 . San Diego : NeurIPS , 2024 : 45850 - 45878 .

TAUD H , MAS J F . Multilayer perceptron (MLP) [M ] // Geomatic Approaches for Modeling Land Change Scenarios . Cham : Springer International Publishing , 2017 : 451 - 455 .

FATHULLAH Y , WU C Y , LAKOMKIN E , et al . Prompting large language models with speech recognition abilities [C ] // ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2024 : 13351 - 13355 .

JEON J , LEE S , CHOI S . A systematic review of research on speech-recognition chatbots for language learning: Implications for future directions in the era of large language models [J ] . Interactive Learning Environments , 2024 , 32 ( 8 ): 4613 - 4631 .

ZEGHIDOUR N , LUEBS A , OMRAN A , et al . SoundStream: An end-to-end neural audio codec [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2022 , 30 : 495 - 507 .

SATYANARAYAN A , HEER J . Lyra: An interactive visualization design environment [J ] . Computer Graphics Forum , 2014 , 33 ( 3 ): 351 - 360 .

VAN DEN OORD A , VINYALS O , KAVUKCUOGLU K . Neural discrete representation learning [C ] // Advances in Neural Information Processing Systems 30 (NIPS 2017) . San Diego : NeurIPS , 2017 : 6309 - 6318 .

RAZAVI A , OORD A V D , VINYALS O . Generating diverse high-fidelity images with VQ-VAE-2 [C ] // Advances in Neural Information Processing Systems 32 . San Diego : NeurIPS , 2019 : 11240 .

CHEN S , WU Y , WANG C , et al . BEATs: Audio pre-training with acoustic tokenizers [C ] // International Conference on Machine Learning . Cambridge : PMLR , 2023 : 5178 - 5193 .

AN W D , LI R W , GE H Y , et al . An end-to-end audio transformer with multi-student knowledge distillation algorithm for deepfake speech detection [C ] // Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition . New York : ACM , 2024 : 366 - 371 .

BARNES C F , RIZVI S A , NASRABADI N M . Advances in residual vector quantization: A review [J ] . IEEE Transactions on Image Processing , 1996 , 5 ( 2 ): 226 - 262 .

YAO X , NEWSON A , GOUSSEAU Y , et al . A style-based GAN encoder for high fidelity reconstruction of images and videos [C ] // European Conference on Computer Vision . Cham : Springer Nature Switzerland , 2022 : 581 - 597 .

CHANG K W , WU H B , WANG Y K , et al . SpeechPrompt: Prompting speech language models for speech processing tasks [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2024 , 32 : 3730 - 3744 .

CHEN Z H , HUANG H , ANDRUSENKO A , et al . SALM: Speech-augmented language model with in-context learning for speech recognition and translation [C ] // ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2024 : 13521 - 13525 .

GAJDOŠ M , HUMMER K , KRESSE G , et al . Linear optical properties in the projector-augmented wave methodology [J ] . Physical Review B , 2006 , 73 ( 4 ): 045112 .

HUANG Z L , WANG X G , HUANG L C , et al . CCNet: Criss-cross attention for semantic segmentation [C ] // 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2019 : 603 - 612 .

LI J , LI D , SAVARESE S , et al . Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models [C ] // International Conference on Machine Learning . Cambridge : PMLR , 2023 : 19730 - 19742 .

LIN J H , JIANG N F , ZHANG Z T , et al . LMQFormer: A Laplace-prior-guided mask query transformer for lightweight snow removal [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2023 , 33 ( 11 ): 6225 - 6235 .

CHOWDHERY A , NARANG S , DEVLIN J , et al . Palm: Scaling language modeling with pathways [J ] . Journal of Machine Learning Research , 2023 , 24 ( 240 ): 1 - 113 .

RAFFEL C , SHAZEER N , ROBERTS A , et al . Exploring the limits of transfer learning with a unified text-to-text transformer [J ] . Journal of Machine Learning Research , 2020 , 21 ( 140 ): 1 - 67 .

WYATT S , ELLIOTT D , ARAVAMUDAN A , et al . Environmental sound classification with tiny transformers in noisy edge environments [C ] // 2021 IEEE 7th World Forum on Internet of Things . Piscataway : IEEE , 2021 : 309 - 314 .

GU J X , LI C , ZHANG B C , et al . Projection convolutional neural networks for 1-bit CNNs via discrete back propagation [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2019 , 33 ( 1 ): 8344 - 8351 .

GAUTHIER J . Conditional generative adversarial nets for convolutional face generation [J ] . Class project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition , Winter Semester, 2014 , 2014 ( 5 ): 2 .

Kumar K , Kumar R , de Boissiere T , et al . MelGAN: Generative adversarial networks for conditional waveform synthesis [C ] // Proceedings of the 33rd International Conference on Neural Information Processing Systems . 2019 : 14910 - 14921 .

LIUH , CHENZ , YUANY , et al . AudioLDM: Text-to-audio generation with latent diffusion models [C ] // Proceedings of the 40th International Conference on Machine Learning . Cambridge : PMLR , 2023 : 21450 - 21474 .

OUYANG L , WU J , JIANG X , et al . Training language models to follow instructions with human feedback [C ] // Advances in Neural Information Processing Systems 35 . San Diego : NeurIPS , 2022 : 27730 - 27744 .

JING Y , ZHU X L , LIU X B , et al . Exploring visual pre-training for robot manipulation: Datasets, models and methods [C ] // 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems . Piscataway : IEEE , 2023 : 11390 - 11395 .

BORSOS Z , MARINIER R , VINCENT D , et al . AudioLM: A language modeling approach to audio generation [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2023 , 31 : 2523 - 2533 .

KHOSLA P , TETERWAK P , WANG C , et al . Supervised contrastive learning [EB/OL ] . ( 2021-03-10 )[ 2025-03-20 ] . https://arXiv.org/abs/2004.11362 https://arXiv.org/abs/2004.11362 .

XU X , WANG T , YANG Y , et al . Cross-modal attention with semantic consistence for image-text matching [J ] . IEEE Transactions on Neural Networks and Learning Systems , 2020 , 31 ( 12 ): 5412 - 5425 .

GUO C P , WANG S Y , XIE R L , et al . Estimating energy consumption of neural networks with joint Structure-Device encoding [J ] . Sustainable Computing: Informatics and Systems , 2025 , 45 : 101062 .

HOULSBY N , GIURGIU A , JASTRZEBSKI S , et al . Parameter-efficient transfer learning for NLP [EB/OL ] . ( 2019-06-13 )[ 2025-03-20 ] . https://arXiv.org/abs/1902.00751 https://arXiv.org/abs/1902.00751 .

LI X L , LIANG P . Prefix-tuning: Optimizing continuous prompts for generation [C ] // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing . Stroudsburg : ACL , 2021 : 4582 - 4597 .

MAO A Q , MOHRI M , ZHONG Y T . Cross-entropy loss functions: Theoretical analysis and applications [EB/OL ] . ( 2023-06-20 )[ 2025-03-20 ] . https://arXiv.org/abs/2304.07288 https://arXiv.org/abs/2304.07288 .

SALAZAR J , LIANG D , NGUYEN T Q , et al . Masked language model scoring [C ] // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : ACL , 2020 : 2699 - 2712 .

PANAYOTOV V , CHEN G G , POVEY D , et al . Librispeech: An ASR corpus based on public domain audio books [C ] // 2015 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2015 : 5206 - 5210 .

ARDILA R , BRANSON M , DAVIS K , et al . Common voice: A massively-multilingual speech corpus [EB/OL ] . ( 2020-03-05 )[ 2025-03-20 ] . https://arXiv.org/abs/1912.06670 https://arXiv.org/abs/1912.06670 .

ZHANG B B , LV H , GUO P C , et al . WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2022 : 6182 - 6186 .

WANG C H , WU A , GU J T , et al . CoVoST 2 and massively multilingual speech translation [C ] // Interspeech 2021 . New York : ACM , 2021 : 2247 - 2251 .

CHUNG J S , NAGRANI A , ZISSERMAN A . VoxCeleb2: Deep speaker recognition [C ] // Interspeech 2018 . Los Angeles : ISCA , 2018 : 1086 - 1090 .

ZHANG C , TAN X , REN Y , et al . UWSpeech: Speech to speech translation for unwritten languages [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 16 ): 14319 - 14327 .

GEMMEKE J F , ELLIS D P W , FREEDMAN D , et al . Audio Set: An ontology and human-labeled dataset for audio events [C ] // 2017 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2017 : 776 - 780 .

XIAN Y Q , SCHIELE B , AKATA Z . Zero-shot learning: The good, the bad and the ugly [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2017 : 3077 - 3086 .

DONG G T , YUAN H Y , LU K M , et al . How abilities in large language models are affected by supervised fine-tuning data composition [C ] // Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics . Stroudsburg : ACL , 2024 : 177 - 198 .

YU C , VELU A , VINITSKY E , et al . The surprising effectiveness of PPO in cooperative, multi-agent games [EB/OL ] . ( 2022-11-04 )[ 2025-03-20 ] . https://arXiv.org/abs/2103.01955 https://arXiv.org/abs/2103.01955 .

RAFAILOV R , SHARMA A , MITCHELL E , et al . Direct preference optimization: Your language model is secretly a reward model [EB/OL ] . ( 2024-07-29 )[ 2025-03-20 ] . https://arXiv.org/abs/2305.18290 https://arXiv.org/abs/2305.18290 .

CASANOVA E , WEBER J , SHULBY C , et al . YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone [EB/OL ] . ( 2023-04-30 )[ 2025-03-20 ] . https://arXiv.org/abs/2112.02418 https://arXiv.org/abs/2112.02418 .

HU S J , ZHOU L , LIU S J , et al . WavLLM: Towards robust and adaptive speech large language model [C ] // Findings of the Association for Computational Linguistics: EMNLP 2024 . Stroudsburg : ACL , 2024 : 4552 - 4572 .

兰朝凤 , 王顺博 , 郭小霞 , 等 . 基于DCNN和BiLSTM的单通道视听融合语音分离方法研究 [J ] . 电子学报 , 2023 , 51 ( 4 ): 914 - 921 .

LAN C F , WANG S B , GUO X X , et al . A single channel audio-visual fusion speech separation method based on DCNN and BiLSTM [J ] . Acta Electronica Sinica , 2023 , 51 ( 4 ): 914 - 921 . (in Chinese)

STREIJL R C , WINKLER S , HANDS D S . Mean opinion score (MOS) revisited: Methods and applications, limitations and alternatives [J ] . Multimedia Systems , 2016 , 22 ( 2 ): 213 - 227 .

苏兆品 , 张羚 , 张国富 , 等 . 基于多特征融合和BiLSTM的语音隐写检测算法 [J ] . 电子学报 , 2023 , 51 ( 5 ): 1300 - 1309 .

SU Z P , ZHANG L , ZHANG G F , et al . A speech steganalysis algorithm based on multi-feature fusion and BiLSTM [J ] . Acta Electronica Sinica , 2023 , 51 ( 5 ): 1300 - 1309 . (in Chinese)

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Cross-Modal Pedestrian Re-identification Pre-training Method Based on Catastrophic Forgetting and Combination Superimposed Erasure

Related Author

LI Ai-jun

ZHAO Bin

SUN Rui

XIE Rui-rui

ZHANG Lei

ZHANG Xu-dong

GAO Jun

XIE Rui-rui

Related Institution

Chinese Academy of Social Sciences

School of Computer Science and Information Engineering， Hefei University of Technology

Anhui Province Key Laboratory of Industry Safety and Emergency Technology

⁰