基于联合学习框架的音频场景聚类

张聿晗; 李艳雄; 江钟杰; 陈昊

doi:10.12263/DZXB.20200573

您当前的位置：

首页 >

文章列表页 >

基于联合学习框架的音频场景聚类

学术论文 | 更新时间：2025-12-08

- 基于联合学习框架的音频场景聚类
- Audio Scene Clustering Based on Joint Learning Framework
- 电子学报 2021年49卷第10期页码：2041-2047
- 作者机构：
  
  华南理工大学电子与信息学院，广东广州 510640
- 作者简介：
  
  [ "张聿晗男，1995年6月出生，安徽黄山人.现为华南理工大学电子与信息学院硕士研究生，主要研究方向为语音及音频信号处理、机器学习." ]
  [ "李艳雄（通信作者）　男，1980年8月出生，湖南嘉禾人.现为华南理工大学电子与信息学院副教授、博士生导师，主要研究方向为语音及音频信号处理、机器学习、模式识别. E-mail: eeyxli@scut.edu.cn" ]
- 基金信息：
  
  国家自然科学基金(61771200)
- DOI：10.12263/DZXB.20200573
  中图分类号： TN912.3;
- 收稿：2020-06-15，
  
  修回：2020-10-04，
  
  纸质出版：2021-10-25
- 稿件说明：
移动端阅览
张聿晗,李艳雄,江钟杰等.基于联合学习框架的音频场景聚类[J].电子学报,2021,49(10):2041-2047.

ZHANG Yu-han,LI Yan-xiong,JIANG Zhong-jie,et al.Audio Scene Clustering Based on Joint Learning Framework[J].ACTA ELECTRONICA SINICA,2021,49(10):2041-2047.
张聿晗,李艳雄,江钟杰等.基于联合学习框架的音频场景聚类[J].电子学报,2021,49(10):2041-2047. DOI： 10.12263/DZXB.20200573.

ZHANG Yu-han,LI Yan-xiong,JIANG Zhong-jie,et al.Audio Scene Clustering Based on Joint Learning Framework[J].ACTA ELECTRONICA SINICA,2021,49(10):2041-2047. DOI： 10.12263/DZXB.20200573.

摘要

音频场景聚类的任务是将属于相同音频场景的音频样本合并到同一个类中.本文提出一种基于联合学习框架的音频场景聚类方法.该框架由一个卷积自编码网络（Convolution Autoencoder Network

CAN）与一个判别性聚类网络（Discriminative Clustering Network

DCN）组成.CAN包括编码器和译码器，用于提取深度变换特征，DCN用于对输入的深度变换特征进行类别估计从而实现音频场景聚类.采用DCASE-2017和LITIS-Rouen数据集作为实验数据，比较不同特征与聚类方法的性能.实验结果表明：采用归一化互信息和聚类精度作为评价指标时，基于联合学习框架提取的深度变换特征优于其他特征，本文方法优于其他方法.本文方法所需要付出的代价是需要较大的计算复杂度.

Abstract

Audio scene clustering (ASC) is a task to merge audio samples belonging to the same type of acoustic scene into a single cluster. This paper proposes a method of ASC based on joint learning framework. The proposed framework consists of a convolution autoencoder network (CAN) and a discriminative clustering network (DCN). The CAN is used to extract deep transformed feature (DTF)

while the DCN is used to do cluster estimation on the input DTF for realizing ASC. Two datasets

DCASE-2017 and LITIS-Rouen

are used as experimental data

and the performance of different features and clustering methods are compared. Experimental results show that the DTF extracted by the joint learning framework outperforms other features

and our method is superior to other methods

in terms of the metrics of both normalized mutual information and clustering accuracy. The cost of the proposed method is the higher computational complexity.

关键词

Keywords

references

金海 . 基于深度神经网络的音频事件检测 [D]. 广州 : 华南理工大学 , 2016 .

Jin H . Audio events detection based on deep neural network [D]. Guangzhou, China : South China University of Technology , 2016 . (in Chinese)

李应 , 印佳丽 . 基于多随机森林的低信噪比声音事件检测 [J]. 电子学报 , 2018 , 46 ( 11 ): 2705 － 2713 .

Li Y , Yin J L . Sound event detection at low SNR based on multi-random forests [J]. Acta Electronica Sinica , 2018 , 46 ( 11 ): 2705 － 2713 . (in Chinese)

李艳雄 , 王琴 , 张雪 , 等 . 基于凝聚信息瓶颈的音频事件聚类方法 [J]. 电子学报 , 2017 , 45 ( 5 ): 1064 － 1071 .

Li Y X , Wang Q , Zhang X , et al . Audio events clustering based on agglomerative information bottleneck [J]. Acta Electronica Sinica , 2017 , 45 ( 5 ): 1064 － 1071 . (in Chinese)

Mesaros A , Heittola T , Benetos E , et al . Detection and classification of acoustic scenes and events: Outcome of the DCASE 2016 challenge [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2018 , 26 ( 2 ): 379 － 393 .

Li Y X , Li X K , Zhang Y H , et al . Acoustic scene classification using deep audio feature and BLSTM network [A]. 2018 International Conference on Audio, Language and Image Processing (ICALIP)[C] . Shanghai, China : IEEE , 2018 . 371 － 374 .

Schröder J , Moritz N , Anemüller J , et al . Classifier architectures for acoustic scenes and events: Implications for DNNs, TDNNs, and perceptual features from DCASE 2016 [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2017 , 25 ( 6 ): 1304 － 1314 .

Rakotomamonjy A , Gasso G . Histogram of gradients of time-frequency representations for audio scene classification [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2015 , 23 ( 1 ): 142 － 153 .

Yang W J , Krishnan S . Combining temporal features by local binary pattern for acoustic scene classification [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2017 , 25 ( 6 ): 1315 － 1321 .

Abidin S , Togneri R , Sohel F . Spectrotemporal analysis using local binary pattern variants for acoustic scene classification [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2018 , 26 ( 11 ): 2112 － 2121 .

Bisot V , Serizel R , Essid S , et al . Feature learning with matrix factorization applied to acoustic scene classification [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2017 , 25 ( 6 ): 1216 － 1229 .

Rakotomamonjy A . Supervised representation learning for audio scene classification [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2017 , 25 ( 6 ): 1253 － 1265 .

Li Y X , Zhang X , Jin H , et al . Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection [J]. Multimedia Tools and Applications , 2018 , 77 ( 1 ): 897 － 916 .

Singh A , Thakur A , Rajan P , et al . A layer-wise score level ensemble framework for acoustic scene classification [A]. 2018 26th European Signal Processing Conference (EUSIPCO) [C]. Rome, Italy : IEEE , 2018 . 837 － 841 .

Ren Z , et al . Deep sequential image features for acoustic scene classification [A]. Detection and Classification of Acoustic Scenes and Events [C]. Munich, Germany : Workshop , 2017 . 113 － 117 .

Wu Y Z , Lee T . Enhancing sound texture in CNN-based acoustic scene classification [A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) [C]. Brighton, UK : IEEE , 2019 . 815 － 819 .

McDonnell M D , Gao W . Acoustic scene classification using deep residual networks with late fusion of separated high and low frequency paths [A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) [C]. Barcelona, Spain : IEEE , 2020 . 141 － 145 .

Chen H T , Zhang P Y , Yan Y H . An audio scene classification framework with embedded filters and a DCT-based temporal module [A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) [C]. Brighton, UK : IEEE , 2019 . 835 － 839 .

Zhang T , Wu J . Constrained learned feature extraction for acoustic scene classification [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2019 , 27 ( 8 ): 1216 － 1228 .

Bai X , Du J , Pan J , et al . High-resolution attention network with acoustic segment model for acoustic scene classification [A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) [C]. Barcelona, Spain : IEEE , 2020 . 656 － 660 .

Zhang L W , Shi Z Q , Han J Q . Pyramidal temporal pooling with discriminative mapping for audio classification [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2020 , 28 : 770 － 784 .

Phaye S S R , Benetos E , Wang Y . SubSpectralNet － using sub-spectrogram based convolutional neural networks for acoustic scene classification [A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) [C]. Brighton, UK : IEEE , 2019 . 825 － 829 .

Li S Y , Wang W W . Randomly sketched sparse subspace clustering for acoustic scene clustering [A]. The 26th European Signal Processing Conference (EUSIPCO) [C]. Rome, Italy : IEEE , 2018 . 2489 － 2493 .

Li S Y , Gu Y T , Luo Y H , et al . Enhanced streaming based subspace clustering applied to acoustic scene data clustering [A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) [C]. Brighton, UK : IEEE , 2019 . 11 － 15 .

Annamaria M , Toni H , Aleksandr D , et al . DCASE2017 challenge setup : Tasks, datasets and baseline system[A]. Detection and Classification of Acoustic Scenes and Events [C]. Munich, Germany : Workshop , 2017 . 85 － 92 .

Li Y X , Zhang X , Li X K , et al . Mobile phone clustering from speech recordings using deep representation and spectral clustering [J]. IEEE Transactions on Information Forensics and Security , 2018 , 13 ( 4 ): 965 － 977 .

Paseddula C , Gangashetty S V . DNN based acoustic scene classification using score fusion of MFCC and inverse MFCC [A]. 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS) [C]. Rupnagar, India : IEEE , 2018 . 18 － 21 .

Meng H , Yan T H , Yuan F , et al . Speech emotion recognition from 3D log-mel spectrograms with deep learning network [J]. IEEE Access , 2019 , 7 : 125868 － 125881 .

Schröder J , Goetze S , Anemüller J . Spectro-temporal Gabor filterbank features for acoustic event detection [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2015 , 23 ( 12 ): 2198 － 2208 .

Li Y X , Jin H , Li W , et al . Fast speaker clustering using distance of feature matrix mean and adaptive convergence threshold [J]. IET Signal Processing , 2014 , 8 ( 8 ): 844 － 851 .

MacQueen J . Some methods for classification and analysis on multivariate observations [A]. The Fifth Berkeley Symposium on Mathematical Statistics and Probability [C]. Durham, NC, USA : Project Euclid , 1967 . 281 － 297 .

Nirmala G , Thyagharajan K K . A modern approach for image forgery detection using BRICH clustering based on normalised mean and standard deviation [A]. International Conference on Communication and Signal Processing (ICCSP) [C]. Chennai, India : IEEE , 2019 . 441 － 444 .

Jing X X , Zhan L , Zhao H , et al . Speaker recognition system using the improved GMM-based clustering algorithm [A]. International Conference on Intelligent Computing and Integrated Systems [C]. Guilin, China : IEEE , 2010 . 482 － 485 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据