1.上海交通大学计算机学院,上海 200240
2.中南大学计算机学院,湖南长沙 410083
3.清华大学计算机科学与技术系,北京 100842
[ "鲁昱 男,2001年6月出生于四川省简阳市.现为上海交通大学计算机学院计算机科学与技术专业博士生.主要研究方向为移动计算与智能感知.E-mail: yulu01@sjtu.edu.cn" ]
[ "付永健 男,1999年2月出生于四川省成都市.2021年本科毕业于中南大学物联网工程专业.现为清华大学计算机系访问博士生,中南大学计算机科学与技术专业在读博士生.主要研究方向为边缘智能. E-mail: yongjianwork@gmail.com" ]
[ "丁典 男,1994年1月出生于江苏省海安市.现为上海交通大学计算机学院博士后.主要研究方向为移动计算与智能感知.E-mail: dingdian94@sjtu.edu.cn" ]
[ "潘昊 男,1994年11月出生于江苏省高邮市.2022年毕业于上海交通大学计算机系,现为上海交通大学计算机学院长聘教轨副教授.主要研究方向为无线通信、无线感知、可穿戴医疗等.E-mail: panh09@sjtu.edu.cn" ]
[ "薛广涛 男,1976年5月出生于江苏省徐州市.2004年于上海交通大学获计算机软件与理论专业博士学位,现为上海交通大学计算机学院特聘教授.主要研究方向为物联网技术、智能感知、分布式计算系统、数据流通与治理等.E-mail: gt_xue@sjtu.edu.cn" ]
[ "任炬 男,1987年12月出生于湖南省汨罗市.博士,清华大学计算机与技术系长聘副教授.国家级人才项目获得者.主要研方向域为边缘智能计算与智能协作、边缘智能安全与隐私保护.E-mail: renju@tsinghua.edu.cn" ]
收稿:2025-06-19,
录用:2025-09-25,
纸质出版:2025-10-25
移动端阅览
鲁昱, 付永健, 丁典, 等. 面向边缘设备的轻量化神经语音压缩方法[J]. 电子学报, 2025, 53(10): 3483-3496.
LU Yu, FU Yong-jian, DING Dian, et al. A Lightweight Neural Speech Compression Method for Edge Devices[J]. Acta Electronica Sinica, 2025, 53(10): 3483-3496.
鲁昱, 付永健, 丁典, 等. 面向边缘设备的轻量化神经语音压缩方法[J]. 电子学报, 2025, 53(10): 3483-3496. DOI:10.12263/DZXB.20250524
LU Yu, FU Yong-jian, DING Dian, et al. A Lightweight Neural Speech Compression Method for Edge Devices[J]. Acta Electronica Sinica, 2025, 53(10): 3483-3496. DOI:10.12263/DZXB.20250524
近年来,神经网络驱动的音频压缩方法在低比特率语音重建方面表现出显著优势,但其高计算开销与部署复杂度限制了在边缘设备上的实际应用.为此,本文面向移动终端等资源受限场景,提出一种轻量化的神经语音压缩系统.该系统在Funcodec框架基础上,对编码器模块进行优化设计,构建了基于卷积神经网络的简化结构,并引入融合感知对齐、频谱约束和对抗训练的知识蒸馏策略,有效迁移教师模型的表征能力.实验结果表明,所提出的卷积神经网络编码器在保持压缩质量接近原系统的前提下,大幅降低模型复杂度与推理延迟,可在边缘设备上实现毫秒级音频压缩处理.进一步地,针对原始量化索引中存在的冗余问题,本文提出基于哈夫曼树的变长编码方法,在不影响重建精度的条件下节省约5%的存储空间,提升系统的传输效率.综合实验结果表明,所提出方案在压缩质量、计算开销与工程部署可行性之间实现了良好平衡,具备在实际语音采集与感知系统中广泛推广的潜力.
Neural audio compression methods have shown remarkable performance in low-bitrate speech reconstruction
but their high computational cost and deployment complexity limit their practical use on edge devices. To address this issue
this paper proposes a lightweight neural speech compression system tailored for resource-constrained scenarios such as mobile terminals. Based on the Funcodec framework
we redesign the encoder module using a streamlined convolutional neural network architecture and introduce a multi-objective knowledge distillation strategy that integrates perceptual alignment
spectral constraints and adversarial training. Experimental results demonstrate that the proposed convolutional neural network encoder significantly reduces model complexity and inference latency while maintaining comparable compression quality
enabling millisecond-level real-time speech encoding on edge devices. Furthermore
to improve transmission efficiency
we present a Huffman coding-based entropy optimization method that adaptively encodes residual quantization outputs
achieving an average storage reduction of approximately 5% without compromising reconstruction quality. Overall
the proposed system strikes a favorable balance between compression fidelity
computational efficiency and deployability
making it well-suited for real-world speech acquisition and processing applications on edge platforms.
MENG T , LI W F , YUAN C , et al . AsTree: An audio subscription architecture enabling massive-scale multi-party conferencing [C ] // 22nd USENIX Symposium on Networked Systems Design and Implementation . Berkeley : USENIX Association , 2025 : 653 - 666 .
张聿晗 , 李艳雄 , 江钟杰 , 等 . 基于联合学习框架的音频场景聚类 [J ] . 电子学报 , 2021 , 49 ( 10 ): 2041 - 2047 .
ZHANG Y H , LI Y X , JIANG Z J , et al . Audio scene clustering based on joint learning framework [J ] . Acta Electronica Sinica , 2021 , 49 ( 10 ): 2041 - 2047 . (in Chinese)
白海钏 , 鲍长春 , 刘鑫 . 基于局部最小二乘支持向量机的音频频带扩展方法 [J ] . 电子学报 , 2016 , 44 ( 9 ): 2203 - 2210 .
BAI H C , BAO C C , LIU X . Audio bandwidth extension method based on local least square support vector machine [J ] . Acta Electronica Sinica , 2016 , 44 ( 9 ): 2203 - 2210 . (in Chinese)
RADFORD A , KIM J W , XU T , et al . Robust speech recognition via large-scale weak supervision [C ] // Proceedings of the 40th International Conference on Machine Learning . New York : ACM , 2023 : 28492 - 28518 .
SHA F , SAUL L K . Large margin hidden markov models for automatic speech recognition [M ] // Advances in Neural Information Processing Systems 19 . Cambridge : The MIT Press , 2007 : 1249 - 1256 .
PENG Y F , DALMIA S , LANE I , et al . Branchformer: Parallel MLP-attention architectures to capture local and global context for speech recognition and understanding [EB/OL ] . ( 2022-07-06 )[ 2025-08-03 ] . https://arXiv.org/abs/2207. 02971 https://arXiv.org/abs/2207.02971 .
章子旭 , 简志华 . 采用双重交换表示分离的任意说话人语音转换 [J ] . 电子学报 , 2024 , 52 ( 6 ): 2141 - 2150 .
ZHANG Z X , JIAN Z H . Any-to-any voice conversion using double exchange representation separation [J ] . Acta Electronica Sinica , 2024 , 52 ( 6 ): 2141 - 2150 . (in Chinese)
MIN D C , LEE D B , YANG E , et al . Meta-StyleSpeech: Multi-speaker adaptive text-to-speech generation [EB/OL ] . ( 2021-06-16 )[ 2025-08-30 ] . https://arXiv.org/abs/2106.03153 https://arXiv.org/abs/2106.03153 .
LE M , VYAS A , SHI B W , et al . Voicebox: Text-guided multilingual universal speech generation at scale [EB/OL ] . ( 2023- 10-19 )[ 2025-08-30 ] . https://arXiv.org/abs/2306.15687 https://arXiv.org/abs/2306.15687 .
G N B , ANEES M , G T Y . Speech coding techniques and challenges: A comprehensive literature survey [J ] . Multimedia Tools and Applications , 2024 , 83 ( 10 ): 29859 - 29879 .
唐昆 , 崔慧娟 , 刘志勇 , 等 . 高质量4~8kb/s变速率有限状态ACELP 语音编码算法研究 [J ] . 电子学报 , 2000 , 28 ( 1 ): 21 - 25 .
TANG K , CUI H J , LIU Z Y , et al . 4~8kb/s variable rate-finite state-algebraic code excited linear prediction speech coding algorithm [J ] . Acta Electronica Sinica , 2000 , 28 ( 1 ): 21 - 25 . (in Chinese)
TAN K , WANG D L . Towards model compression for deep learning based speech enhancement [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2021 , 29 : 1785 - 1794 .
CHENG P , ROEDIG U . Personal voice assistant security and privacy: A survey [J ] . Proceedings of the IEEE , 2022 , 110 ( 4 ): 476 - 507 .
CHEN T , YANG Y J , FAN X R , et al . Exploring the feasibility of remote cardiac auscultation using earphones [C ] // Proceedings of the 30th Annual International Conference on Mobile Computing and Networking . New York : ACM , 2024 : 357 - 372 .
AKRAM M F , WANG S G , ANWAR M R , et al . A comprehensive survey on MEC enabled tactile internet: Applications, challenges, and efficient resource handling [J ] . Chinese Journal of Electronics , 2025 , 34 ( 5 ): 1449 - 1463 .
DU J , ZOU X , HAO J , et al . The efficiency of ICA-based representation analysis: Application to speech feature extraction [J ] . Chinese Journal of Electronics , 2011 , 20 ( 2 ): 287 - 292 .
CHEN Q L , YE A Y , ZHANG Q , et al . A new edge perturbation mechanism for privacy-preserving data collection in IOT [J ] . Chinese Journal of Electronics , 2023 , 32 ( 3 ): 603 - 612 .
HUANG Z W , DONG M , MAO Q R , et al . Speech emotion recognition using CNN [C ] // Proceedings of the 22nd ACM International Conference on Multimedia . New York : ACM , 2014 : 801 - 804 .
LATIF S , RANA R , KHALIFA S , et al . Survey of deep representation learning for speech emotion recognition [J ] . IEEE Transactions on Affective Computing , 2023 , 14 ( 2 ): 1634 - 1654 .
YE J X , WEN X C , WEI Y J , et al . Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition [C ] // ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 1 - 5 .
NAGRANI A , CHUNG J S , ZISSERMAN A . VoxCeleb: A large-scale speaker identification dataset [C ] // Interspeech 2017 . Singapore : ISCA , 2017 : 2616 - 2620 .
CHUNG J S , NAGRANI A , ZISSERMAN A . VoxCeleb2: Deep speaker recognition [C ] // Interspeech 2018 . Singapore : ISCA , 2018 : 1086 - 1090 .
LIU T C , LEE K A , WANG Q Q , et al . Disentangling voice and content with self-supervision for speaker recognition [C ] // Proceedings of the 37th International Conference on Neural Information Processing Systems . New York : ACM , 2023 : 50221 - 50236 .
BRANDENBURG K . MP3 and AAC explained [C ] // Audio Engineering Society Conference: 17th International Conference on High-Quality Audio Coding . New York : Audio Engineering Society , 1999 : 1 - 12 .
VALIN J M , VOS K , TERRIBERRY T . RFC 6716: Definition of the Opus audio codec [EB/OL ] . ( 2012-09-01 )[ 2025-08-20 ] . https://dl.acm.org/doi/book/10.17487/RFC6716#sec Authors https://dl.acm.org/doi/book/10.17487/RFC6716#secAuthors .
DIETZ M , MULTRUS M , EKSLER V , et al . Overview of the EVS codec architecture [C ] // 2015 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2015 : 5698 - 5702 .
ZEGHIDOUR N , LUEBS A , OMRAN A , et al . SoundStream: An end-to-end neural audio codec [J ] . IEEE/ACM Transactions on Audio , Speech and Language Processing, 2021 , 30 : 495 - 507 .
DÉFOSSEZ A , COPET J , SYNNAEVE G , et al . High fidelity neural audio compression [EB/OL ] . ( 2022-10-24 )[ 2025-08-20 ] . https://arXiv.org/abs/2210.13438 https://arXiv.org/abs/2210.13438 .
KUMAR R , SEETHARAMAN P , LUEBS A , et al . High-fidelity audio compression with improved RVQGAN [C ] // Proceedingsofthe 37th International Conference on Neural Information Processing Systems . New York : ACM , 2023 : 27980 - 27993 .
DU Z H , ZHANG S L , HU K , et al . FunCodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec [C ] // ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2024 : 591 - 595 .
O'SHEA K , NASH R . An introduction to convoluti-onal neural networks [EB/OL ] . ( 2015-12-02 )[ 2025-08-20 ] . https://arXiv.org/abs/1511.08458 https://arXiv.org/abs/1511.08458 .
TAGLIASACCHI M , LI Y P , MISIUNAS K , et al . SEANet: A multi-modal speech enhancement network [EB/OL ] . ( 2020-10-01 )[ 2025-08-20 ] . https://arXiv.org/abs/2009.02095 https://arXiv.org/abs/2009.02095 .
JANG J W , LEE S , KIM D , et al . Sparsity-aware and re-configurable NPU architecture for samsung flagship mobile SoC [C ] // Proceedings of the 48th Annual International Symposium on Computer Architecture . New York : ACM , 2021 : 15 - 28 .
GOU J P , YU B S , MAYBANK S J , et al . Knowledge distillation: A survey [J ] . International Journal of Computer Vision , 2021 , 129 ( 6 ): 1789 - 1819 .
VITTER J S . Design and analysis of dynamic Huffman co-des [J ] . Journal of the ACM , 1987 , 34 ( 4 ): 825 - 845 .
BARNES C F , RIZVI S A , NASRABADI N M . Advances in residual vector quantization: A review [J ] . IEEE Transactions on Image Processing , 1996 , 5 ( 2 ): 226 - 262 .
PANAYOTOV V , CHEN G G , POVEY D , et al . Librispeech: An ASR corpus based on public domain audio boo-ks [C ] // 2015 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2015 : 5206 - 5210 .
BU H , DU J Y , NA X Y , et al . AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseli-ne [C ] // 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) . Piscataway : IEEE , 2018 : 1 - 5 .
PAN D . A tutorial on MPEG/audio compression [J ] . IEEE MultiMedia , 1995 , 2 ( 2 ): 60 - 74 .
AHMED N , NATARAJAN T , RAO K R . Discrete cosine transform [J ] . IEEE Transactions on Computers , 1974 , C-23( 1 ): 90 - 93 .
LIU P , LI S B , WANG H Q . Steganography integrated into linear predictive coding for low bit-rate speech codec [J ] . Multimedia Tools and Applications , 2017 , 76 ( 2 ): 2837 - 2859 .
CLEVERT D A , UNTERTHINER T , HOCHREITER S . Fast and accurate deep network learning by exponential linear units (ELUs) [EB/OL ] . ( 2016-02-22 )[ 2025-08-20 ] . https://arXiv.org/abs/1511.07289 https://arXiv.org/abs/1511.07289 .
GRIFFIN D , LIM J . Signal estimation from modified short-time Fourier transform [J ] . IEEE Transactions on Acoustics, Speech, and Signal Processing , 1984 , 32 ( 2 ): 236 - 243 .
STEVENS S S , VOLKMANN J , NEWMAN E B . A scale for the measurement of the psychological magnitude pit-ch [J ] . The Journal of the Acoustical Society of America , 1937 , 8 ( 3 ): 185 - 190 .
GENTILE C , WARMUTH M K . Linear hinge loss and average margin [C ] // Proceedings of the 12th International Conference on Neural Information Processing Systems . New York : ACM , 1998 : 225 - 231 .
YANG H J , FRITZSCHE M , BARTZ C , et al . BMXNet: An open-source binary neural network implementation based on MXNet [C ] // Proceedings of the 25th ACM International Conference on Multimedia . New York : ACM , 2017 : 1209 - 1212 .
HAGOS T . Learn Android Studio 3 with Kotlin: Efficient Android App Development [M ] . Berkeley : Apress , 2018 .
IMAMBI S , PRAKASH K B , KANAGACHIDAMBARESAN G R . PyTorch [M ] // Programming with TensorFlow . Cham : Springer International Publishing , 2021 : 87 - 104 .
KINGMA D P , BA J . Adam: A method for stochastic optimization [EB/OL ] . ( 2017-01-30 )[ 2025-08-20 ] . https://arX-iv.org/abs/1412.6980 https://arX-iv.org/abs/1412.6980 .
RIX A W , BEERENDS J G , HOLLIER M P , et al . Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and code-cs [C ] // 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing . Piscataway : IEEE , 2002 : 749 - 752 .
HINES A , SKOGLUND J , KOKARAM A , et al . ViSQOL: The virtual speech quality objective listener [C ] // IWAENC 2012 International Workshop on Acoustic Signal Enhancement . VDE , 2012 : 1 - 4 .
GUO Y , XU Y N , CHEN X Q . Freeze it if you can: Challenges and future directions in benchmarking smartphone performance [C ] // Proceedings of the 18th International Workshop on Mobile Computing Systems and Applications . New York : ACM , 2017 : 25 - 30 .
0
浏览量
1
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621