面向微控制单元的高效语音隐私保护编码器

蔡栋琪; 王尚广; 张泽凌; 马骁; 徐梦炜

doi:10.12263/DZXB.20241154

您当前的位置：

首页 >

文章列表页 >

面向微控制单元的高效语音隐私保护编码器

学术论文 | 更新时间：2025-12-27

- 面向微控制单元的高效语音隐私保护编码器
- Efficient and Privacy-Preserving Spoken Language Understanding for Resource-Constrained Microcontroller Unit
- 电子学报 2025年53卷第8期页码：2601-2613
- 作者机构：
  
  1.北京邮电大学计算机学院，北京 100876
  2.网络与交换技术全国重点实验室，北京 100876
- 作者简介：
  
  [ "蔡栋琪男，1999年8月生，江苏盐城人.现为北京邮电大学计算机学院直博四年级博士研究生.现于剑桥大学进行联合培养访问研究.主要研究方向为高效的终端侧机器学习系统.中国电子学会会员编号：E190182924A. E-mail: dc912@cam.ac.uk" ]
  [ "王尚广男，1982年2月生，河南周口人.2011年毕业于北京邮电大学，获博士学位.现为北京邮电大学计算机学院教授.主要研究方向为服务计算、移动边缘计算与卫星计算.已发表论文150余篇.中国电子学会会员编号：E190027924S.E-mail: sgwang@bupt.edu.cn" ]
  [ "张泽凌男，2000年8月生，四川成都人. 现为北京邮电大学计算机学院硕士研究生. E-mail: marovlo@bupt.edu.cn" ]
  [ "马骁女，1990年9月生，山东德州人.博士，2018年毕业于清华大学计算机科学与技术系.现为北京邮电大学网络与交换技术国家重点实验室副教授.主要研究方向为移动云计算与移动边缘计算. E-mail: maxiao18@bupt.edu.cn" ]
  [ "徐梦炜男，1992年6月生，浙江绍兴人.现为北京邮电大学计算机学院副教授.主要研究方向为移动计算、边缘计算、人工智能与系统软件等.中国电子学会会员编号：E190024575M. E-mail: mwx@bupt.edu.cn" ]
- 基金信息：
  
  国家自然科学基金(62032003;U21B2016;62425203);中国科学技术协会青年人才托举工程项目(2023QNRC001)
- DOI：10.12263/DZXB.20241154
  中图分类号： TP31;TP36
- 收稿：2024-12-23，
  
  录用：2025-04-15，
  
  纸质出版：2025-08-25
- 稿件说明：
移动端阅览
蔡栋琪, 王尚广, 张泽凌, 等. 面向微控制单元的高效语音隐私保护编码器[J]. 电子学报, 2025, 53(08): 2601-2613.

CAI Dong-qi, WANG Shang-guang, ZHANG Ze-ling, et al. Efficient and Privacy-Preserving Spoken Language Understanding for Resource-Constrained Microcontroller Unit[J]. Acta Electronica Sinica, 2025, 53(08): 2601-2613.
蔡栋琪, 王尚广, 张泽凌, 等. 面向微控制单元的高效语音隐私保护编码器[J]. 电子学报, 2025, 53(08): 2601-2613. DOI：10.12263/DZXB.20241154

CAI Dong-qi, WANG Shang-guang, ZHANG Ze-ling, et al. Efficient and Privacy-Preserving Spoken Language Understanding for Resource-Constrained Microcontroller Unit[J]. Acta Electronica Sinica, 2025, 53(08): 2601-2613. DOI：10.12263/DZXB.20241154

摘要

语音是现有嵌入式移动设备广泛使用的一种输入接口.尽管现有的云端服务提供商提供了强大的语音语言理解（Spoken Language Understanding，SLU）服务，但也对用户隐私造成了极大的威胁.为此，基于信息解耦的隐私保护编码器被提出，以在不影响SLU功能的前提下，从语音信号中移除敏感信息.然而，这些编码器往往需要较高的内存和复杂的计算，因而在资源受限的小型设备上难以实际应用.本文基于大量实验观察到了一个关键现象，即SLU依赖于整个语句的全局信息，而隐私敏感词的识别则多为局部信息依赖.利用这一观察，我们提出了一个面向语音意图理解的高效编码器（SImpLe ENCodEr designed for efficient privacy-preserving SLU offloading，SILENCE）系统.我们在STM32H7微控制单元上实现了该系统，并在不同的攻击场景下评估了其效果.实验结果表明：SILENCE在语音意图提取任务的性能和隐私保护能力上可与传统隐私保护编码器媲美，同时实现了高达53.3倍的速度提升和134.1倍的内存占用减少，首次在内存仅有1 MB的微控制单元上实现了隐私保护的SLU服务.

Abstract

Speech input is increasingly adopted as an intuitive interface for various embedded mobile devices. Cloud-based solutions provide powerful speech language understanding (SLU) capabilities but introduce privacy risks

as sensitive information may be processed remotely. To address these concerns

disentanglement-based encoders have been developed to strip sensitive data from audio signals

allowing SLU without compromising privacy. However

such encoders are often memory-intensive and computationally demanding

limiting their practicality on resource-constrained devices. Based on extensive experiments

this paper observes a key phenomenon: SLU relies on global information from the entire sentence

whereas the recognition of privacy-sensitive words predominantly depends on local information. We implemented simple encoder designed for efficient privacy-preserving SLU offloading (SILENCE) on an STM32H7 microcontroller and evaluated its performance under various privacy threat scenarios. Results demonstrate that SILENCE provides competitive speech intent classification accuracy and privacy protection compared to more complex encoders. Simultaneously

it achieves a speedup of up to 53.3 times and a reduction in memory footprint by 134.1 times

marking the first time that privacy-preserving SLU services have been realized on a microcontroller with only 1 MB of memory.

关键词

Keywords

references

BAEVSKI A , ZHOU Y , MOHAMED A , AULI M . wav2vec 2.0: A framework for self-supervised learning of speech representations [C ] // Advances in Neural Information Processing Systems 33 . Red Hook : Curran Associates , 2020 : 12449 - 12460 .

SENEVIRATNE S , HU Y N , NGUYEN T , et al . A survey of wearable devices and challenges [J ] . IEEE Communications Surveys & Tutorials , 2017 , 19 ( 4 ): 2573 - 2620 .

CLARK L , DOYLE P , GARAIALDE D , et al . The state of speech in HCI: Trends, themes and challenges [J ] . Interacting with Computers , 2019 , 31 ( 4 ): 349 - 371 .

NORUWANA N C , OWOLAWI P A , MAPAYI T . Interactive IoT-based speech-controlled home automation system [C ] // Proceedings of the 2nd International Multidisciplinary Information Technology and Engineering Conference (IMITEC) . Piscataway : IEEE , 2020 : 1 - 8 .

RAJA J M , ELSAKR C , ROMAN S , et al . Apple watch, wearables, and heart rhythm: Where do we stand? [J ] . Annals of Translational Medicine , 2019 , 7 ( 17 ): 417 .

EMOKPAE L E , EMOKPAE R N , LALOUANI W , et al . Smart multimodal telehealth-IoT system for COVID-19 patients [J ] . IEEE Pervasive Computing , 2021 , 20 ( 2 ): 73 - 80 .

KUMAR N , LEE S C . Human-machine interface in smart factory: A systematic literature review [J ] . Technological Forecasting and Social Change , 2022 , 174 : 121284 .

SUN K , CHEN C , ZHANG X Y . “Alexa, stop spying on me!”: Speech privacy protection against voice assistants [C ] // Proceedings of the 18th Conference on Embedded Networked Sensor Systems . New York : ACM , 2020 : 298 - 311 .

DANG T , THAKKAR O , RAMASWAMY S , et al . A method to reveal speaker identity in distributed ASR training, and how to counter IT [C ] // 2022 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2022 : 4338 - 4342 .

QIAN J W , DU H H , HOU J H , et al . Hidebehind: Enjoy voice input with voiceprint unclonability and anonymity [C ] // Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems . New York : ACM , 2018 : 82 - 94 .

WANG Y G , HUANG W , YANG L . Privacy-preserving end-to-end spoken language understanding [C ] // Proceedings of the 32nd International Joint Conference on Artificial Intelligence . California : International Joint Conferences on Artificial Intelligence Organization , 2023 : 5224 - 5232 .

ALOUFI R , HADDADI H , BOYLE D . Privacy-preserving voice analysis via disentangled representations [C ] // Proceedings of the 2020 ACM SIGSAC Conference on Cloud Computing Security Workshop . New York : ACM , 2020 : 1 - 14 .

PEYSER C , HUANG R W , ROSENBERG A , et al . Towards disentangled speech representations [C ] // Proceedings of Interspeech 2022 . Baixas : ISCA , 2022 : 3603 - 3607 .

ZHANG C L , LI S Y , XIA J Z , et al . BatchCrypt: Efficient homomorphic encryption for cross-silo federated learning [C ] // USENIX Annual Technical Conference 2020 . Berkeley : USENIX Association , 2020 : 493 - 506 .

GOLDREICH O . Secure multi-party computation [EB/OL ] . ( 1998 )[ 2024-12-12 ] . https://www.researchgate.net/profile/Oded-Goldreich/publication/2934115_Secure_Multi-Party_Computation/links/00b7d52bb04f7027d4000000/Secure-Multi-Party-Computation.pdf https://www.researchgate.net/profile/Oded-Goldreich/publication/2934115_Secure_Multi-Party_Computation/links/00b7d52bb04f7027d4000000/Secure-Multi-Party-Computation.pdf .

DONG Y , LU W J , ZHENG Y C , et al . PUMA: Secure inference of Llama-7B in five minutes [EB/OL ] . ( 2023-07-24 )[ 2024-12-12 ] . https://arxiv.org/abs/2307.12533 https://arxiv.org/abs/2307.12533 .

CHEN Y K , GAO M , LI Y M , et al . Big brother is listening: An evaluation framework on ultrasonic microphone jammers [C ] // IEEE INFOCOM 2022 - IEEE Conference on Computer Communications . Piscataway : IEEE , 2022 : 1119 - 1128 .

AHMED S , CHOWDHURY A R , FAWAZ K , RAMANATHAN P . Preech: A system for privacy-preserving speech transcription [C ] // USENIX Security Symposium 2020 . Berkeley : USENIX Association , 2020 : 2703 - 2720 .

GAO M , CHEN Y K , LIU Y J , et al . Cancelling speech signals for speech privacy protection against microphone eavesdropping [C ] // Proceedings of the 29th Annual International Conference on Mobile Computing and Networking . New York : ACM , 2023 : 1 - 16 .

GULATI A , QIN J , CHIU C C , et al . Conformer: Convolution-augmented transformer for speech recognition [C ] // Interspeech 2020 . Baixas : ISCA , 2020 : 5036 - 5040 .

DENG K Q , CAO S J , ZHANG Y K , et al . Improving hybrid CTC/attention end-to-end speech recognition with pretrained acoustic and language models [C ] // 2021 IEEE Automatic Speech Recognition and Understanding Workshop . Piscataway : IEEE , 2021 : 76 - 82 .

GOODFELLOW I , POUGET-ABADIE J , MIRZA M , et al . Generative adversarial nets [C ] // Advances in Neural Information Processing Systems . Red Hook : Curran Associates , 2014 : 2672 - 2680 .

ARORA S , DALMIA S , CHANG X K , et al . Two-pass low latency end-to-end spoken language understanding [C ] // Proceedings of the Annual Conference of the International Speech Communication Association . Incheon : ISCA , 2022 : 3478 - 3482 .

DE CAO N , SCHLICHTKRULL M S , AZIZ W , et al . How do decisions emerge across layers in neural models? Interpretation with differentiable masking [C ] // Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing . Stroudsburg : ACL , 2020 : 3243 - 3255 .

BASTIANELLI E , VANZO A , SWIETOJANSKI P , et al . SLURP: A spoken language understanding resource package [C ] // Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing . Stroudsburg : ACL , 2020 : 7252 - 7262 .

DESHMUKH S , ELIZALDE B , SINGH R , et al . Pengi: An audio language model for audio tasks [J ] . Advances in Neural Information Processing Systems , 2023 , 36 : 18090 - 18108 .

WANG J X , RADFAR M , WEI K , et al . End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders [C ] // 2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 1 - 5 .

AGRAWAL B , MÜLLER M , CHOUDHARY S , et al . Tie your embeddings down: Cross-modal latent spaces for end-to-end spoken language understanding [C ] // ICASSP 2022 . Piscataway : IEEE , 2022 : 7157 - 7161 .

WATANABE S , HORI T , KIM S , et al . Hybrid CTC/attention architecture for end-to-end speech recognition [J ] . IEEE Journal of Selected Topics in Signal Processing , 2017 , 11 ( 8 ): 1240 - 1253 .

CHOROWSKI J , BAHDANAU D , SERDYUK D , et al . Attention-based models for speech recognition [C ] // Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1 . New York : ACM , 2015 : 577 - 585 .

DE MORI R . Spoken language understanding: A survey [C ] // 2007 IEEE Workshop on Automatic Speech Recognition & Understanding . Piscataway : IEEE , 2007 : 365 - 376 .

HAGHANI P , NARAYANAN A , BACCHIANI M , et al . From audio to semantics: Approaches to end-to-end spoken language understanding [C ] // 2018 IEEE Spoken Language Technology Workshop . Piscataway : IEEE , 2018 : 720 - 726 .

HSU W N , BOLTE B , TSAI Y H , et al . HuBERT: Self-supervised speech representation learning by masked prediction of hidden units [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2021 , 29 : 3451 - 3460 .

PRABHAVALKAR R , HORI T , SAINATH T N , et al . End-to-end speech recognition: A survey [J ] . IEEE/ACM Transactions on Audio , Speech and Language Processing, 2023 , 32 : 325 - 351 .

HUANG R Z , ZHANG X H , NI Z H , et al . Less peaky and more accurate CTC forced alignment by label priors [C ] // ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2024 : 11831 - 11835 .

LOUIZOS C , WELLING M , KINGMA D P . Learning sparse neural networks through L0 regularization [C ] // International Conference on Learning Representations , 2018 .

BASTINGS J , AZIZ W , TITOV I . Interpretable neural predictions with differentiable binary variables [C ] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : ACL , 2019 : 2963 - 2977 .

RAVANELLI M , PARCOLLET T , MOUMEN A , et al . Open-source conversational AI with SpeechBrain 1.0 [J ] . Journal of Machine Learning Research , 2024 , 25 : 1 - 11 .

MEHRISH A , MAJUMDER N , BHARADWAJ R , et al . A review of deep learning techniques for speech processing [J ] . Information Fusion , 2023 , 99 : 101869 .

PANAYOTOV V , CHEN G G , POVEY D , et al . Librispeech: An ASR corpus based on public domain audio books [C ] // 2015 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2015 : 5206 - 5210 .

WOLF T , DEBUT L , SANH V , et al . Transformers: State-of-the-art natural language processing [C ] // Proceedings of EMNLP 2020: System Demonstrations . Stroudsburg : ACL , 2020 : 38 - 45 .

RADFORD A , KIM J W , XU T , et al . Robust speech recognition via large-scale weak supervision [C ] // Proceedings of the 40th International Conference on Machine Learning . New York : ACM , 2023 : 28492 - 28518 .

HAO X , SU X D , WEN S X , et al . Masking and inpainting: A two-stage speech enhancement approach for low SNR and non-stationary noise [C ] // ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2020 : 6959 - 6963 .

KEGLER M , BECKMANN P , CERNAK M . Deep speech inpainting of time-frequency masks [C ] // Proceedings of the Annual Conference of the International Speech Communication Association . Shanghai : ISCA , 2020 : 3276 - 3280 .

MOLINER E , LEHTINEN J , VÄLIMÄKI V . Solving audio inverse problems with a diffusion model [C ] // ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 1 - 5 .

WANG R X , LIN F X . Turbocharge speech understanding with pilot inference [C ] // Proceedings of the 30th Annual International Conference on Mobile Computing and Networking . New York : ACM , 2024 : 1299 - 1313 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

移动社交网络多密钥混淆的交友隐私保护方案研究

传感器网络中基于路线的隐私保护数据聚集算法