

浏览全部资源
扫码关注微信
北京工业大学信息科学技术学院,北京 100124
Received:19 October 2025,
Accepted:23 January 2026,
Published:25 January 2026
移动端阅览
高尚, 贾懋珅. 基于多分辨率并行特征提取的多声源分离方法[J]. 电子学报, 2026, 54(01): 183-194.
GAO Shang, JIA Maoshen. Multi-Sound Source Separation Method Based on Multi-Resolution Parallel Feature Extraction[J]. Acta Electronica Sinica, 2026, 54(01): 183-194.
高尚, 贾懋珅. 基于多分辨率并行特征提取的多声源分离方法[J]. 电子学报, 2026, 54(01): 183-194. DOI:10.12263/DZXB.20250937
GAO Shang, JIA Maoshen. Multi-Sound Source Separation Method Based on Multi-Resolution Parallel Feature Extraction[J]. Acta Electronica Sinica, 2026, 54(01): 183-194. DOI:10.12263/DZXB.20250937
随着万物互联、智能感知及人机交互等技术的快速发展,复杂声场环境下的多声源分离已成为语音信号处理领域的重要的前端问题。然而,非平稳语音信号在不同时间和频率尺度呈现出不同的能量分布特性,其中既包括快速变化的共振峰结构,也包含相对平稳的谐波与周期信息。传统的单一时频分辨率分析方法在该场景下面临根本性约束:当分析窗较短时,频率分辨率不足,难以区分多个声源的谐波结构;而当窗长较长时,时间分辨率下降,又难以捕获语音快速变化的瞬态特征。因此,当前多声源分离方法在复杂声环境下往往表现出时频结构解析不足、语音细节丢失与分离失衡等问题。现有基于固定分辨率的分离方法在真实复杂声学环境中,常出现时频结构模糊、语音细节损失及分离后信号失真等问题,限制了系统在实际场景中的鲁棒性与可用性。为解决上述问题,所提方法实现了一种多分支并行的深度神经网络,每个分支独立处理由不同窗长生成的时频谱,并采用嵌套的层次化递归单元进行特征建模。具体而言,每个分支内部设计了两级递归模块:频率-空间建模单元(Frequency Long Short-Term Memory,F-LSTM)沿频带方向递归,提取跨通道的空间相关性与频域结构;时间-空间建模单元(Time Long Short-Term Memory,T-LSTM)沿时间轴递归,捕捉语音信号的长期动态演化与时序依赖性。此外,所提方法将不同分析窗生成的多组不同分辨率的时频谱并行输入网络,实现网络对于时间与频率分辨率的互补。在训练过程中,各分支通过共享的时域重建损失进行联合优化,推动网络学习跨分辨率的一致性表示与互补特征。每一个分支均设置嵌套结构以增强跨分辨率特征的交互与融合能力。在网络输出端,各分支估计的复数谱掩蔽经融合层集成,通过逆短时傅里叶变换重建时域信号,最终在时域和频域双重约束下进行端到端训练。所提多分辨率融合方案在高混响、多说话人环境下均能显著提升语音分离的客观指标与主观听感,且具备良好的结构灵活性,可迁移至其他基于时频分析的网络框架中,为未来面向复杂声场的多源分离模型设计提供了可扩展的思路与方法基础。
With the rapid advancement of technologies such as the Internet of Everything
intelligent sensing
and human-machine interaction
multi-source separation in complex acoustic environments has become a crucial front-end challenge in speech signal processing. However
non-stationary speech signals exhibit distinct energy distribution characteristics across different temporal and frequency scales
encompassing both rapidly changing formant structures and relatively stable harmonic and periodic information. Traditional single-resolution time-frequency analysis methods face fundamental constraints in such scenarios: short analysis windows yield insufficient frequency resolution
hindering the distinction of harmonic structures across multiple sources; conversely
longer windows degrade temporal resolution
compromising the capture of rapidly changing transient features in speech. Consequently
existing multi-source separation techniques often exhibit inadequate time-frequency structure analysis
loss of speech details
and separation imbalance in complex acoustic environments. Therefore
existing fixed-resolution separation methods frequently suffer from blurred time-frequency structures
loss of speech detail
and distorted separated signals in real complex acoustic environments
which limits the robustness and practicality of the system in real-world scenarios. To address these challenges
the proposed method implements a multi-branch parallel deep neural network. Each branch independently processes time-frequency spectra generated with different window lengths and employs nested hierarchical recurrent units for feature modeling. Specifically
each branch incorporates a two-stage recursive module: a frequency-spatial modeling unit (Frequency Long Short-Term Memory
F-LSTM) that operates along the frequency axis to extract cross-channel spatial correlations and spectral structures
and a time-spatial modeling unit (Time Long Short-Term Memory
T-LSTM) that recurs over time to capture the long-term dynamic evolution and temporal dependencies of speech signals. Furthermore
the approach feeds multiple sets of time-frequency spectra—generated from different analysis windows and featuring varying resolutions—into the network in parallel. During training
all branches are jointly optimized through a shared time-domain reconstruction loss
promoting the learning of consistent and complementary representations across resolutions. Each branch incorporates a nested architecture to enhance the interaction and fusion of cross-resolution features. At the output stage
the complex spectral masks estimated by each branch are integrated via a fusion layer
and the time-domain signal is reconstructed through inverse short-time Fourier transform
ultimately enabling end-to-end training under both time-domain and spectral constraints. Through multi-resolution joint optimization
the model simultaneously captures transient details and periodic harmonic structures within the speech spectrogram. The proposed multi-resolution fusion scheme significantly improves both objective metrics and subjective listening quality in highly reverberant and multi-speaker environments
and demonstrates structural flexibility
making it transferable to other time-frequency analysis-based network frameworks
thereby providing a scalable design approach and methodological foundation for future multi-source separation models targeting complex acoustic fields.
Turchet L , Fazekas G , Lagrange M , et al . The Internet of audio things: State of the art, vision, and challenges [J ] . IEEE Internet of Things Journal , 2020 , 7 ( 10 ): 10233 - 10249 . DOI: 10.1109/jiot.2020.2997047 http://dx.doi.org/10.1109/jiot.2020.2997047
Turchet L , Fischione C , Essl G , et al . Internet of musical things: Vision and challenges [J ] . IEEE Access , 2018 , 6 : 61994 - 62017 . DOI: 10.1109/access.2018.2872625 http://dx.doi.org/10.1109/access.2018.2872625
Turchet L , Casari P . The Internet of musical things meets satellites: Evaluating starlink support for networked music performances in rural areas [C ] // 2024 IEEE 5th International Symposium on the Internet of Sounds . Piscataway : IEEE , 2024 : 10704207 . DOI: 10.1109/is262782.2024.10704207 http://dx.doi.org/10.1109/is262782.2024.10704207
Gabrielli L , Principi E , Turchet L . Sustainability and the Internet of sounds: Case studies [J ] . IEEE Transactions on Technology and Society , 2025 , 6 ( 2 ): 165 - 180 . DOI: 10.1109/tts.2024.3513777 http://dx.doi.org/10.1109/tts.2024.3513777
Bosi M , Servetti A , Chafe C , et al . Experiencing remote classical music performance over long distance: A JackTrip concert between two continents during the pandemic [J ] . Journal of the Audio Engineering Society , 2021 , 69 ( 12 ): 934 - 945 . DOI: 10.17743/jaes.2021.0056 http://dx.doi.org/10.17743/jaes.2021.0056
Zlabinger T . Managing telematic pain: Migrating a student ensemble online during COVID [PP/OL ] . V2. arVix ( 2024‑10‑15 )[ 2025-12-20 ] . https://aes2.org/publications/elibrary-page/ id=21105 https://aes2.org/publications/elibrary-page/id=21105 .
Chen Xi , Mo Yefei , Ouyang Kang , et al . Internet streaming audio based speech reception threshold measurement in cochlear implant users [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2022 : 9012 - 9016 . DOI: 10.1109/icassp43922.2022.9747404 http://dx.doi.org/10.1109/icassp43922.2022.9747404
Hershey J R , Chen Zhuo , Le Roux J , et al . Deep clustering: Discriminative embeddings for segmentation and separation [C ] // 2016 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2016 : 31 - 35 . DOI: 10.1109/icassp.2016.7471631 http://dx.doi.org/10.1109/icassp.2016.7471631
张亚洲 , 刘祈蒙 , 戎璐 , 等 . 语音大模型: 架构、训练与挑战分析 [J ] . 电子学报 , 2025 , 53 ( 9 ): 3454 - 3472 .
Zhang Yazhou , Liu Qimeng , Rong Lu , et al . Speech large language models: Architecture, training and challenges analysis [J ] . Acta Electronica Sinica , 2025 , 53 ( 9 ): 3454 - 3472 . (in Chinese)
苏兆品 , 周晓琳 , 张国富 , 等 . 基于对抗学习和增强优化的深度转换语音还原方法 [J ] . 电子学报 , 2025 , 53 ( 6 ): 1815 - 1828 .
Su Zhaopin , Zhou Xiaolin , Zhang Guofu , et al . Adversarial learning and enhanced optimization based restoration method for VC-generated speeches [J ] . Acta Electronica Sinica , 2025 , 53 ( 6 ): 1815 - 1828 . (in Chinese)
周静 , 鲍长春 , 张旭 . 基于聚焦信号子空间估计导向矢量的干扰声源抑制方法 [J ] . 电子学报 , 2023 , 51 ( 1 ): 76 - 85 .
Zhou Jing , Bao Changchun , Zhang Xu . Suppression method of the interference sound sources by estimated steering vector based on the focusing signal subspace [J ] . Acta Electronica Sinica , 2023 , 51 ( 1 ): 76 - 85 . (in Chinese)
Yu Dong , Kolbæk M , Tan Zhenghua , et al . Permutation invariant training of deep models for speaker-independent multi-talker speech separation [C ] // 2017 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2017 : 241 - 245 . DOI: 10.1109/icassp.2017.7952154 http://dx.doi.org/10.1109/icassp.2017.7952154
Luo Yi , Mesgarani N . TaSNet: Time-domain audio separation network for real-time, single-channel speech separation [C ] // 2018 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2018 : 696 - 700 . DOI: 10.1109/icassp.2018.8462116 http://dx.doi.org/10.1109/icassp.2018.8462116
Wang Deliang , Chen Jitong . Supervised speech separation based on deep learning: An overview [J ] . IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2018 , 26 ( 10 ): 1702 - 1726 . DOI: 10.1109/taslp.2018.2842159 http://dx.doi.org/10.1109/taslp.2018.2842159
Delfarah M , Wang Deliang . Deep learning for talker-dependent reverberant speaker separation: An empirical study [J ] . IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2019 , 27 ( 11 ): 1839 - 1848 . DOI: 10.1109/taslp.2019.2934319 http://dx.doi.org/10.1109/taslp.2019.2934319
Wang Xianyun , Bao Changchun , Cheng Rui . IRM with phase parameterization for speech enhancement [C ] // 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics . Piscataway : IEEE , 2019 : 209 - 213 . DOI: 10.1109/waspaa.2019.8937085 http://dx.doi.org/10.1109/waspaa.2019.8937085
Luo Yi , Mesgarani N . Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation [J ] . ACM Transactions on Audio, Speech, and Language Processing , 2019 , 27 ( 8 ): 1256 - 1266 . DOI: 10.1109/taslp.2019.2915167 http://dx.doi.org/10.1109/taslp.2019.2915167
Luo Yi , Chen Zhuo , Yoshioka T . Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation [C ] // ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2020 : 46 - 50 . DOI: 10.1109/icassp40776.2020.9054266 http://dx.doi.org/10.1109/icassp40776.2020.9054266
Liu Debang , Zhang Tianqi , Wei Ying , et al . Speech conv-mamba: Selective structured state space model with temporal dilated convolution for efficient speech separation [J ] . IEEE Signal Processing Letters , 2025 , 32 : 2015 - 2019 . DOI: 10.1109/lsp.2025.3560237 http://dx.doi.org/10.1109/lsp.2025.3560237
Siddiqua A , Basha C H , Abbas H M , et al . Real-time audio noise reduction and speech enhancement using LadderNet with hybrid spectrogram time-domain audio separation network [C ] // 2024 4th International Conference on Mobile Networks and Wireless Communications . Piscataway : IEEE , 2024 : 10872040 . DOI: 10.1109/icmnwc63764.2024.10872040 http://dx.doi.org/10.1109/icmnwc63764.2024.10872040
Lin Jingru , Ge Meng , Ao J Y , et al . SA-WavLM: Speaker-aware self-supervised pre-training for mixture speech [C ] // Interspeech 2024 . ISCA , 2024 : 597 - 601 . DOI: 10.21437/interspeech.2024-787 http://dx.doi.org/10.21437/interspeech.2024-787
Hsieh T A , Choi H , Kim M . Multimodal representation loss between timed text and audio for regularized speech separation [C ] // Interspeech 2024 . ISCA , 2024 : 592 - 596 . DOI: 10.21437/interspeech.2024-1265 http://dx.doi.org/10.21437/interspeech.2024-1265
Li Xiaofei , Horaud R . Multichannel speech enhancement based on time-frequency masking using subband long short-term memory [C ] // 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics . Piscataway : IEEE , 2019 : 298 - 302 . DOI: 10.1109/waspaa.2019.8937218 http://dx.doi.org/10.1109/waspaa.2019.8937218
Wood S U N , Stahl J K W , Mowlaee P . Binaural codebook-based speech enhancement with atomic speech presence probability [J ] . IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2019 , 27 ( 12 ): 2150 - 2161 . DOI: 10.1109/taslp.2019.2937174 http://dx.doi.org/10.1109/taslp.2019.2937174
Sivaraman A , Wisdom S , Erdogan H , et al . Adapting speech separation to real-world meetings using mixture invariant training [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2022 : 686 - 690 . DOI: 10.1109/icassp43922.2022.9747855 http://dx.doi.org/10.1109/icassp43922.2022.9747855
Tzinis E , Adi Y , Ithapu V K , et al . RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing [J ] . IEEE Journal of Selected Topics in Signal Processing , 2022 , 16 ( 6 ): 1329 - 1341 . DOI: 10.1109/jstsp.2022.3200911 http://dx.doi.org/10.1109/jstsp.2022.3200911
Aralikatti R , Boeddeker C , Wichern G , et al . Reverberation as supervision for speech separation [C ] // ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 10095022 . DOI: 10.1109/icassp49357.2023.10095022 http://dx.doi.org/10.1109/icassp49357.2023.10095022
Maciejewski M , Wichern G , McQuinn E , et al . WHAMR!: Noisy and reverberant single-channel speech separation [C ] // ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2020 : 696 - 700 . DOI: 10.1109/icassp40776.2020.9053327 http://dx.doi.org/10.1109/icassp40776.2020.9053327
Saijo K , Ogawa T . Self-remixing: Unsupervised speech separation VIA separation and remixing [C ] // ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 10095596 . DOI: 10.1109/icassp49357.2023.10095596 http://dx.doi.org/10.1109/icassp49357.2023.10095596
Schimmel S M , Muller M F , Dillier N . A fast and accurate “shoebox” room acoustics simulator [C ] // 2009 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2009 : 241 - 244 . DOI: 10.1109/icassp.2009.4959565 http://dx.doi.org/10.1109/icassp.2009.4959565
Subakan C , Ravanelli M , Cornell S , et al . Attention is all you need in speech separation [C ] // ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2021 : 21 - 25 . DOI: 10.1109/icassp39728.2021.9413901 http://dx.doi.org/10.1109/icassp39728.2021.9413901
Tesch K , Gerkmann T . Spatially selective deep non-linear filters for speaker extraction [C ] // ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 10096098 . DOI: 10.1109/icassp49357.2023.10096098 http://dx.doi.org/10.1109/icassp49357.2023.10096098
0
Views
8
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621