Multi-Sound Source Separation Method Based on Multi-Resolution Parallel Feature Extraction

GAO Shang; JIA Maoshen

doi:10.12263/DZXB.20250937

您当前的位置：

首页 >

文章列表页 >

Multi-Sound Source Separation Method Based on Multi-Resolution Parallel Feature Extraction

PAPERS | 更新时间：2026-06-04

- Multi-Sound Source Separation Method Based on Multi-Resolution Parallel Feature Extraction
- ACTA ELECTRONICA SINICA Vol. 54, Issue 1, Pages: 183-194(2026)
- 作者机构：
  
  北京工业大学信息科学技术学院，北京 100124
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62471012)
- DOI：10.12263/DZXB.20250937
  CLC： TN912.3;
- Received：19 October 2025，
  
  Accepted：23 January 2026，
  
  Published：25 January 2026
- 稿件说明：
移动端阅览
高尚, 贾懋珅. 基于多分辨率并行特征提取的多声源分离方法[J]. 电子学报, 2026, 54(01): 183-194.

GAO Shang, JIA Maoshen. Multi-Sound Source Separation Method Based on Multi-Resolution Parallel Feature Extraction[J]. Acta Electronica Sinica, 2026, 54(01): 183-194.
高尚, 贾懋珅. 基于多分辨率并行特征提取的多声源分离方法[J]. 电子学报, 2026, 54(01): 183-194. DOI：10.12263/DZXB.20250937

GAO Shang, JIA Maoshen. Multi-Sound Source Separation Method Based on Multi-Resolution Parallel Feature Extraction[J]. Acta Electronica Sinica, 2026, 54(01): 183-194. DOI：10.12263/DZXB.20250937

摘要

随着万物互联、智能感知及人机交互等技术的快速发展，复杂声场环境下的多声源分离已成为语音信号处理领域的重要的前端问题。然而，非平稳语音信号在不同时间和频率尺度呈现出不同的能量分布特性，其中既包括快速变化的共振峰结构，也包含相对平稳的谐波与周期信息。传统的单一时频分辨率分析方法在该场景下面临根本性约束：当分析窗较短时，频率分辨率不足，难以区分多个声源的谐波结构；而当窗长较长时，时间分辨率下降，又难以捕获语音快速变化的瞬态特征。因此，当前多声源分离方法在复杂声环境下往往表现出时频结构解析不足、语音细节丢失与分离失衡等问题。现有基于固定分辨率的分离方法在真实复杂声学环境中，常出现时频结构模糊、语音细节损失及分离后信号失真等问题，限制了系统在实际场景中的鲁棒性与可用性。为解决上述问题，所提方法实现了一种多分支并行的深度神经网络，每个分支独立处理由不同窗长生成的时频谱，并采用嵌套的层次化递归单元进行特征建模。具体而言，每个分支内部设计了两级递归模块：频率-空间建模单元（Frequency Long Short-Term Memory，F-LSTM）沿频带方向递归，提取跨通道的空间相关性与频域结构；时间-空间建模单元（Time Long Short-Term Memory，T-LSTM）沿时间轴递归，捕捉语音信号的长期动态演化与时序依赖性。此外，所提方法将不同分析窗生成的多组不同分辨率的时频谱并行输入网络，实现网络对于时间与频率分辨率的互补。在训练过程中，各分支通过共享的时域重建损失进行联合优化，推动网络学习跨分辨率的一致性表示与互补特征。每一个分支均设置嵌套结构以增强跨分辨率特征的交互与融合能力。在网络输出端，各分支估计的复数谱掩蔽经融合层集成，通过逆短时傅里叶变换重建时域信号，最终在时域和频域双重约束下进行端到端训练。所提多分辨率融合方案在高混响、多说话人环境下均能显著提升语音分离的客观指标与主观听感，且具备良好的结构灵活性，可迁移至其他基于时频分析的网络框架中，为未来面向复杂声场的多源分离模型设计提供了可扩展的思路与方法基础。

Abstract

With the rapid advancement of technologies such as the Internet of Everything

intelligent sensing

and human-machine interaction

multi-source separation in complex acoustic environments has become a crucial front-end challenge in speech signal processing. However

non-stationary speech signals exhibit distinct energy distribution characteristics across different temporal and frequency scales

encompassing both rapidly changing formant structures and relatively stable harmonic and periodic information. Traditional single-resolution time-frequency analysis methods face fundamental constraints in such scenarios: short analysis windows yield insufficient frequency resolution

hindering the distinction of harmonic structures across multiple sources; conversely

longer windows degrade temporal resolution

compromising the capture of rapidly changing transient features in speech. Consequently

existing multi-source separation techniques often exhibit inadequate time-frequency structure analysis

loss of speech details

and separation imbalance in complex acoustic environments. Therefore

existing fixed-resolution separation methods frequently suffer from blurred time-frequency structures

loss of speech detail

and distorted separated signals in real complex acoustic environments

which limits the robustness and practicality of the system in real-world scenarios. To address these challenges

the proposed method implements a multi-branch parallel deep neural network. Each branch independently processes time-frequency spectra generated with different window lengths and employs nested hierarchical recurrent units for feature modeling. Specifically

each branch incorporates a two-stage recursive module: a frequency-spatial modeling unit (Frequency Long Short-Term Memory

F-LSTM) that operates along the frequency axis to extract cross-channel spatial correlations and spectral structures

and a time-spatial modeling unit (Time Long Short-Term Memory

T-LSTM) that recurs over time to capture the long-term dynamic evolution and temporal dependencies of speech signals. Furthermore

the approach feeds multiple sets of time-frequency spectra—generated from different analysis windows and featuring varying resolutions—into the network in parallel. During training

all branches are jointly optimized through a shared time-domain reconstruction loss

promoting the learning of consistent and complementary representations across resolutions. Each branch incorporates a nested architecture to enhance the interaction and fusion of cross-resolution features. At the output stage

the complex spectral masks estimated by each branch are integrated via a fusion layer

and the time-domain signal is reconstructed through inverse short-time Fourier transform

ultimately enabling end-to-end training under both time-domain and spectral constraints. Through multi-resolution joint optimization

the model simultaneously captures transient details and periodic harmonic structures within the speech spectrogram. The proposed multi-resolution fusion scheme significantly improves both objective metrics and subjective listening quality in highly reverberant and multi-speaker environments

and demonstrates structural flexibility

making it transferable to other time-frequency analysis-based network frameworks

thereby providing a scalable design approach and methodological foundation for future multi-source separation models targeting complex acoustic fields.

关键词

Keywords

references

Turchet L , Fazekas G , Lagrange M , et al . The Internet of audio things: State of the art, vision, and challenges [J ] . IEEE Internet of Things Journal , 2020 , 7 ( 10 ): 10233 - 10249 . DOI: 10.1109/jiot.2020.2997047 http://dx.doi.org/10.1109/jiot.2020.2997047

Turchet L , Fischione C , Essl G , et al . Internet of musical things: Vision and challenges [J ] . IEEE Access , 2018 , 6 : 61994 - 62017 . DOI: 10.1109/access.2018.2872625 http://dx.doi.org/10.1109/access.2018.2872625

Turchet L , Casari P . The Internet of musical things meets satellites: Evaluating starlink support for networked music performances in rural areas [C ] // 2024 IEEE 5th International Symposium on the Internet of Sounds . Piscataway : IEEE , 2024 : 10704207 . DOI: 10.1109/is262782.2024.10704207 http://dx.doi.org/10.1109/is262782.2024.10704207

Gabrielli L , Principi E , Turchet L . Sustainability and the Internet of sounds: Case studies [J ] . IEEE Transactions on Technology and Society , 2025 , 6 ( 2 ): 165 - 180 . DOI: 10.1109/tts.2024.3513777 http://dx.doi.org/10.1109/tts.2024.3513777

Bosi M , Servetti A , Chafe C , et al . Experiencing remote classical music performance over long distance: A JackTrip concert between two continents during the pandemic [J ] . Journal of the Audio Engineering Society , 2021 , 69 ( 12 ): 934 - 945 . DOI: 10.17743/jaes.2021.0056 http://dx.doi.org/10.17743/jaes.2021.0056

Zlabinger T . Managing telematic pain: Migrating a student ensemble online during COVID [PP/OL ] . V2. arVix ( 2024‑10‑15 )[ 2025-12-20 ] . https://aes2.org/publications/elibrary-page/ id=21105 https://aes2.org/publications/elibrary-page/id=21105 .

Chen Xi , Mo Yefei , Ouyang Kang , et al . Internet streaming audio based speech reception threshold measurement in cochlear implant users [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2022 : 9012 - 9016 . DOI: 10.1109/icassp43922.2022.9747404 http://dx.doi.org/10.1109/icassp43922.2022.9747404

Hershey J R , Chen Zhuo , Le Roux J , et al . Deep clustering: Discriminative embeddings for segmentation and separation [C ] // 2016 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2016 : 31 - 35 . DOI: 10.1109/icassp.2016.7471631 http://dx.doi.org/10.1109/icassp.2016.7471631

张亚洲 , 刘祈蒙 , 戎璐 , 等 . 语音大模型: 架构、训练与挑战分析 [J ] . 电子学报 , 2025 , 53 ( 9 ): 3454 - 3472 .

Zhang Yazhou , Liu Qimeng , Rong Lu , et al . Speech large language models: Architecture, training and challenges analysis [J ] . Acta Electronica Sinica , 2025 , 53 ( 9 ): 3454 - 3472 . (in Chinese)

苏兆品 , 周晓琳 , 张国富 , 等 . 基于对抗学习和增强优化的深度转换语音还原方法 [J ] . 电子学报 , 2025 , 53 ( 6 ): 1815 - 1828 .

Su Zhaopin , Zhou Xiaolin , Zhang Guofu , et al . Adversarial learning and enhanced optimization based restoration method for VC-generated speeches [J ] . Acta Electronica Sinica , 2025 , 53 ( 6 ): 1815 - 1828 . (in Chinese)

周静 , 鲍长春 , 张旭 . 基于聚焦信号子空间估计导向矢量的干扰声源抑制方法 [J ] . 电子学报 , 2023 , 51 ( 1 ): 76 - 85 .

Zhou Jing , Bao Changchun , Zhang Xu . Suppression method of the interference sound sources by estimated steering vector based on the focusing signal subspace [J ] . Acta Electronica Sinica , 2023 , 51 ( 1 ): 76 - 85 . (in Chinese)

Yu Dong , Kolbæk M , Tan Zhenghua , et al . Permutation invariant training of deep models for speaker-independent multi-talker speech separation [C ] // 2017 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2017 : 241 - 245 . DOI: 10.1109/icassp.2017.7952154 http://dx.doi.org/10.1109/icassp.2017.7952154

Luo Yi , Mesgarani N . TaSNet: Time-domain audio separation network for real-time, single-channel speech separation [C ] // 2018 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2018 : 696 - 700 . DOI: 10.1109/icassp.2018.8462116 http://dx.doi.org/10.1109/icassp.2018.8462116

Wang Deliang , Chen Jitong . Supervised speech separation based on deep learning: An overview [J ] . IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2018 , 26 ( 10 ): 1702 - 1726 . DOI: 10.1109/taslp.2018.2842159 http://dx.doi.org/10.1109/taslp.2018.2842159

Delfarah M , Wang Deliang . Deep learning for talker-dependent reverberant speaker separation: An empirical study [J ] . IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2019 , 27 ( 11 ): 1839 - 1848 . DOI: 10.1109/taslp.2019.2934319 http://dx.doi.org/10.1109/taslp.2019.2934319

Wang Xianyun , Bao Changchun , Cheng Rui . IRM with phase parameterization for speech enhancement [C ] // 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics . Piscataway : IEEE , 2019 : 209 - 213 . DOI: 10.1109/waspaa.2019.8937085 http://dx.doi.org/10.1109/waspaa.2019.8937085

Luo Yi , Mesgarani N . Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation [J ] . ACM Transactions on Audio, Speech, and Language Processing , 2019 , 27 ( 8 ): 1256 - 1266 . DOI: 10.1109/taslp.2019.2915167 http://dx.doi.org/10.1109/taslp.2019.2915167

Luo Yi , Chen Zhuo , Yoshioka T . Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation [C ] // ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2020 : 46 - 50 . DOI: 10.1109/icassp40776.2020.9054266 http://dx.doi.org/10.1109/icassp40776.2020.9054266

Liu Debang , Zhang Tianqi , Wei Ying , et al . Speech conv-mamba: Selective structured state space model with temporal dilated convolution for efficient speech separation [J ] . IEEE Signal Processing Letters , 2025 , 32 : 2015 - 2019 . DOI: 10.1109/lsp.2025.3560237 http://dx.doi.org/10.1109/lsp.2025.3560237

Siddiqua A , Basha C H , Abbas H M , et al . Real-time audio noise reduction and speech enhancement using LadderNet with hybrid spectrogram time-domain audio separation network [C ] // 2024 4th International Conference on Mobile Networks and Wireless Communications . Piscataway : IEEE , 2024 : 10872040 . DOI: 10.1109/icmnwc63764.2024.10872040 http://dx.doi.org/10.1109/icmnwc63764.2024.10872040

Lin Jingru , Ge Meng , Ao J Y , et al . SA-WavLM: Speaker-aware self-supervised pre-training for mixture speech [C ] // Interspeech 2024 . ISCA , 2024 : 597 - 601 . DOI: 10.21437/interspeech.2024-787 http://dx.doi.org/10.21437/interspeech.2024-787

Hsieh T A , Choi H , Kim M . Multimodal representation loss between timed text and audio for regularized speech separation [C ] // Interspeech 2024 . ISCA , 2024 : 592 - 596 . DOI: 10.21437/interspeech.2024-1265 http://dx.doi.org/10.21437/interspeech.2024-1265

Li Xiaofei , Horaud R . Multichannel speech enhancement based on time-frequency masking using subband long short-term memory [C ] // 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics . Piscataway : IEEE , 2019 : 298 - 302 . DOI: 10.1109/waspaa.2019.8937218 http://dx.doi.org/10.1109/waspaa.2019.8937218

Wood S U N , Stahl J K W , Mowlaee P . Binaural codebook-based speech enhancement with atomic speech presence probability [J ] . IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2019 , 27 ( 12 ): 2150 - 2161 . DOI: 10.1109/taslp.2019.2937174 http://dx.doi.org/10.1109/taslp.2019.2937174

Sivaraman A , Wisdom S , Erdogan H , et al . Adapting speech separation to real-world meetings using mixture invariant training [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2022 : 686 - 690 . DOI: 10.1109/icassp43922.2022.9747855 http://dx.doi.org/10.1109/icassp43922.2022.9747855

Tzinis E , Adi Y , Ithapu V K , et al . RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing [J ] . IEEE Journal of Selected Topics in Signal Processing , 2022 , 16 ( 6 ): 1329 - 1341 . DOI: 10.1109/jstsp.2022.3200911 http://dx.doi.org/10.1109/jstsp.2022.3200911

Aralikatti R , Boeddeker C , Wichern G , et al . Reverberation as supervision for speech separation [C ] // ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 10095022 . DOI: 10.1109/icassp49357.2023.10095022 http://dx.doi.org/10.1109/icassp49357.2023.10095022

Maciejewski M , Wichern G , McQuinn E , et al . WHAMR!: Noisy and reverberant single-channel speech separation [C ] // ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2020 : 696 - 700 . DOI: 10.1109/icassp40776.2020.9053327 http://dx.doi.org/10.1109/icassp40776.2020.9053327

Saijo K , Ogawa T . Self-remixing: Unsupervised speech separation VIA separation and remixing [C ] // ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 10095596 . DOI: 10.1109/icassp49357.2023.10095596 http://dx.doi.org/10.1109/icassp49357.2023.10095596

Schimmel S M , Muller M F , Dillier N . A fast and accurate “shoebox” room acoustics simulator [C ] // 2009 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2009 : 241 - 244 . DOI: 10.1109/icassp.2009.4959565 http://dx.doi.org/10.1109/icassp.2009.4959565

Subakan C , Ravanelli M , Cornell S , et al . Attention is all you need in speech separation [C ] // ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2021 : 21 - 25 . DOI: 10.1109/icassp39728.2021.9413901 http://dx.doi.org/10.1109/icassp39728.2021.9413901

Tesch K , Gerkmann T . Spatially selective deep non-linear filters for speaker extraction [C ] // ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2023 : 10096098 . DOI: 10.1109/icassp49357.2023.10096098 http://dx.doi.org/10.1109/icassp49357.2023.10096098

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

A TCAD-DNN-Based Total-Ionizing-Dose Effect Model for FinFET Devices

Adversarial Learning and Enhanced Optimization Based Restoration Method for VC-Generated Speeches

Binary Code Similarity Detection Method Based on Cross-Modal Coordinated Representation Learning

Study on the Distribution Characteristics of Corrosion-Related Static Electric Field of Underwater Vehicles in High Latitude and Low Temperature Sea Area

The Inevitability of Side-Channel Leakage in Encrypted Traffic

Related Author

XU Xinlong

HUANG Xia

SHI Qingyu

WANG Zhen

LI Yuxia

ZHOU Ying

LIU Xiaonian

LIU Yansen

Related Institution

College of Mathematics and Systems Sciences, Shandong University of Science and Technology

College of Electrical Engineering and Automation, Shandong University of Science and Technology

Key Laboratory of Physics and Devices in Post-Moore Era, College of Hunan Province

School of Physics and Electronics, Hunan Normal University

School of Computer and Information Technology, Hefei University of Technology

⁰