电子学报 ›› 2021, Vol. 49 ›› Issue (10): 2041-2047.DOI: 10.12263/DZXB.20200573
张聿晗, 李艳雄, 江钟杰, 陈昊
收稿日期:
2020-06-15
修回日期:
2020-10-04
出版日期:
2021-11-29
作者简介:
基金资助:
ZHANG Yu-han, LI Yan-xiong, JIANG Zhong-jie, CHEN Hao
Received:
2020-06-15
Revised:
2020-10-04
Online:
2021-11-29
Published:
2021-10-25
Supported by:
摘要:
音频场景聚类的任务是将属于相同音频场景的音频样本合并到同一个类中.本文提出一种基于联合学习框架的音频场景聚类方法.该框架由一个卷积自编码网络(Convolution Autoencoder Network, CAN)与一个判别性聚类网络(Discriminative Clustering Network, DCN)组成.CAN包括编码器和译码器,用于提取深度变换特征,DCN用于对输入的深度变换特征进行类别估计从而实现音频场景聚类.采用DCASE-2017和LITIS-Rouen数据集作为实验数据,比较不同特征与聚类方法的性能.实验结果表明:采用归一化互信息和聚类精度作为评价指标时,基于联合学习框架提取的深度变换特征优于其他特征,本文方法优于其他方法.本文方法所需要付出的代价是需要较大的计算复杂度.
中图分类号:
张聿晗, 李艳雄, 江钟杰, 陈昊. 基于联合学习框架的音频场景聚类[J]. 电子学报, 2021, 49(10): 2041-2047.
ZHANG Yu-han, LI Yan-xiong, JIANG Zhong-jie, CHEN Hao. Audio Scene Clustering Based on Joint Learning Framework[J]. Acta Electronica Sinica, 2021, 49(10): 2041-2047.
样本参数 | DCASE-2017 | LITIS-Rouen |
---|---|---|
音频场景类别数 | 15 | 19 |
样本总个数 | 4680 | 3026 |
样本平均时长 | 10s | 30s |
表1 实验数据集
样本参数 | DCASE-2017 | LITIS-Rouen |
---|---|---|
音频场景类别数 | 15 | 19 |
样本总个数 | 4680 | 3026 |
样本平均时长 | 10s | 30s |
特征 | DCASE-2017 | LITIS-Rouen | ||
---|---|---|---|---|
NMI/% | CA/% | NMI/% | CA/% | |
DTF | 61.66 | 52.83 | 58.57 | 50.25 |
MFCC | 47.23 | 43.10 | 44.33 | 42.90 |
LMS | 57.02 | 48.01 | 56.03 | 48.51 |
Gabor | 46.86 | 45.65 | 45.20 | 44.58 |
表2 不同特征的音频场景聚类性能
特征 | DCASE-2017 | LITIS-Rouen | ||
---|---|---|---|---|
NMI/% | CA/% | NMI/% | CA/% | |
DTF | 61.66 | 52.83 | 58.57 | 50.25 |
MFCC | 47.23 | 43.10 | 44.33 | 42.90 |
LMS | 57.02 | 48.01 | 56.03 | 48.51 |
Gabor | 46.86 | 45.65 | 45.20 | 44.58 |
初始聚 类算法 | DCASE-2017 | LITIS-Rouen | ||
---|---|---|---|---|
NMI/% | CA/% | NMI/% | CA/% | |
SC | 61.31 | 50.68 | 55.50 | 48.87 |
K-means | 62.51 | 54.07 | 52.31 | 45.90 |
BIRCH | 67.12 | 56.54 | 60.30 | 55.68 |
GMM | 64.96 | 55.68 | 56.78 | 50.13 |
AHC | 66.20 | 54.26 | 58.53 | 50.30 |
Random | 60.11 | 49.52 | 50.46 | 42.87 |
表3 不同初始聚类算法的影响
初始聚 类算法 | DCASE-2017 | LITIS-Rouen | ||
---|---|---|---|---|
NMI/% | CA/% | NMI/% | CA/% | |
SC | 61.31 | 50.68 | 55.50 | 48.87 |
K-means | 62.51 | 54.07 | 52.31 | 45.90 |
BIRCH | 67.12 | 56.54 | 60.30 | 55.68 |
GMM | 64.96 | 55.68 | 56.78 | 50.13 |
AHC | 66.20 | 54.26 | 58.53 | 50.30 |
Random | 60.11 | 49.52 | 50.46 | 42.87 |
聚类方法 | DCASE-2017 | LITIS-Rouen | ||||
---|---|---|---|---|---|---|
NMI/% | CA/% | 耗时/s | NMI/% | CA/% | 耗时/s | |
本文方法 | 67.12 | 56.54 | 1474 | 60.30 | 55.68 | 983 |
DTF+AHC | 61.66 | 52.83 | 1008 | 58.57 | 50.25 | 733 |
LMS+SC | 57.44 | 50.68 | 153 | 52.86 | 48.88 | 62 |
LMS+K-means | 53.44 | 48.33 | 52 | 40.64 | 35.83 | 49 |
LMS+BIRCH | 60.77 | 51.42 | 78 | 55.83 | 42.63 | 40 |
LMS+GMM | 59.19 | 50.30 | 188 | 53.45 | 42.10 | 96 |
LMS +AHC | 55.98 | 41.23 | 132 | 49.16 | 41.90 | 54 |
表4 不同聚类方法的性能对比
聚类方法 | DCASE-2017 | LITIS-Rouen | ||||
---|---|---|---|---|---|---|
NMI/% | CA/% | 耗时/s | NMI/% | CA/% | 耗时/s | |
本文方法 | 67.12 | 56.54 | 1474 | 60.30 | 55.68 | 983 |
DTF+AHC | 61.66 | 52.83 | 1008 | 58.57 | 50.25 | 733 |
LMS+SC | 57.44 | 50.68 | 153 | 52.86 | 48.88 | 62 |
LMS+K-means | 53.44 | 48.33 | 52 | 40.64 | 35.83 | 49 |
LMS+BIRCH | 60.77 | 51.42 | 78 | 55.83 | 42.63 | 40 |
LMS+GMM | 59.19 | 50.30 | 188 | 53.45 | 42.10 | 96 |
LMS +AHC | 55.98 | 41.23 | 132 | 49.16 | 41.90 | 54 |
1 | 金海. 基于深度神经网络的音频事件检测[D]. 广州: 华南理工大学, 2016. |
JinH. Audio events detection based on deep neural network[D]. Guangzhou, China: South China University of Technology, 2016. (in Chinese) | |
2 | 李应, 印佳丽. 基于多随机森林的低信噪比声音事件检测[J]. 电子学报, 2018, 46(11): 2705-2713. |
LiY, YinJ L. Sound event detection at low SNR based on multi-random forests[J]. Acta Electronica Sinica, 2018, 46(11): 2705-2713. (in Chinese) | |
3 | 李艳雄, 王琴, 张雪, 等. 基于凝聚信息瓶颈的音频事件聚类方法[J]. 电子学报, 2017, 45(5): 1064-1071. |
LiY X, WangQ, ZhangX, et al. Audio events clustering based on agglomerative information bottleneck[J]. Acta Electronica Sinica, 2017, 45(5): 1064-1071. (in Chinese) | |
4 | MesarosA, HeittolaT, BenetosE, et al. Detection and classification of acoustic scenes and events: Outcome of the DCASE 2016 challenge[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(2): 379-393. |
5 | LiY X, LiX K, ZhangY H, et al. Acoustic scene classification using deep audio feature and BLSTM network[A]. 2018 International Conference on Audio, Language and Image Processing (ICALIP)[C]. Shanghai, China: IEEE, 2018. 371-374. |
6 | SchröderJ, MoritzN, AnemüllerJ, et al. Classifier architectures for acoustic scenes and events: Implications for DNNs, TDNNs, and perceptual features from DCASE 2016[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(6): 1304-1314. |
7 | RakotomamonjyA, GassoG. Histogram of gradients of time-frequency representations for audio scene classification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(1): 142-153. |
8 | YangW J, KrishnanS. Combining temporal features by local binary pattern for acoustic scene classification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(6): 1315-1321. |
9 | AbidinS, TogneriR, SohelF. Spectrotemporal analysis using local binary pattern variants for acoustic scene classification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(11): 2112-2121. |
10 | BisotV, SerizelR, EssidS, et al. Feature learning with matrix factorization applied to acoustic scene classification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(6): 1216-1229. |
11 | RakotomamonjyA. Supervised representation learning for audio scene classification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(6): 1253-1265. |
12 | LiY X, ZhangX, JinH, et al. Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection[J]. Multimedia Tools and Applications, 2018, 77(1): 897-916. |
13 | SinghA, ThakurA, RajanP, et al. A layer-wise score level ensemble framework for acoustic scene classification[A]. 2018 26th European Signal Processing Conference (EUSIPCO)[C]. Rome, Italy: IEEE, 2018. 837-841. |
14 | RenZ, et al. Deep sequential image features for acoustic scene classification [A]. Detection and Classification of Acoustic Scenes and Events[C]. Munich, Germany: Workshop, 2017. 113-117. |
15 | WuY Z, LeeT. Enhancing sound texture in CNN-based acoustic scene classification[A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)[C]. Brighton, UK: IEEE, 2019. 815-819. |
16 | SinghA, ThakurA, RajanP, et al. A layer-wise score level ensemble framework for acoustic scene classification[A]. 2018 26th European Signal Processing Conference (EUSIPCO)[C]. Rome, Italy: IEEE, 2018. 837-841. |
17 | McDonnellM D, GaoW. Acoustic scene classification using deep residual networks with late fusion of separated high and low frequency paths[A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)[C]. Barcelona, Spain: IEEE, 2020. 141-145. |
18 | ChenH T, ZhangP Y, YanY H. An audio scene classification framework with embedded filters and a DCT-based temporal module[A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)[C]. Brighton, UK: IEEE, 2019. 835-839. |
19 | ZhangT, WuJ. Constrained learned feature extraction for acoustic scene classification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1216-1228. |
20 | BaiX, DuJ, PanJ, et al. High-resolution attention network with acoustic segment model for acoustic scene classification[A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)[C]. Barcelona, Spain: IEEE, 2020. 656-660. |
21 | ZhangL W, ShiZ Q, HanJ Q. Pyramidal temporal pooling with discriminative mapping for audio classification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 770-784. |
22 | PhayeS S R, BenetosE, WangY. SubSpectralNet - using sub-spectrogram based convolutional neural networks for acoustic scene classification[A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)[C]. Brighton, UK: IEEE, 2019. 825-829. |
23 | LiS Y, WangW W. Randomly sketched sparse subspace clustering for acoustic scene clustering[A]. The 26th European Signal Processing Conference (EUSIPCO)[C]. Rome, Italy: IEEE, 2018. 2489-2493. |
24 | LiS Y, GuY T, LuoY H, et al. Enhanced streaming based subspace clustering applied to acoustic scene data clustering[A]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)[C]. Brighton, UK: IEEE, 2019. 11-15. |
25 | AnnamariaM, ToniH, AleksandrD, et al. DCASE2017 challenge setup: Tasks, datasets and baseline system[A]. Detection and Classification of Acoustic Scenes and Events[C]. Munich, Germany: Workshop, 2017. 85-92. |
26 | LiY X, ZhangX, LiX K, et al. Mobile phone clustering from speech recordings using deep representation and spectral clustering[J]. IEEE Transactions on Information Forensics and Security, 2018, 13(4): 965-977. |
27 | PaseddulaC, GangashettyS V. DNN based acoustic scene classification using score fusion of MFCC and inverse MFCC[A]. 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS)[C]. Rupnagar, India: IEEE, 2018. 18-21. |
28 | MengH, YanT H, YuanF, et al. Speech emotion recognition from 3D log-mel spectrograms with deep learning network[J]. IEEE Access, 2019, 7: 125868-125881. |
29 | SchröderJ, GoetzeS, AnemüllerJ. Spectro-temporal Gabor filterbank features for acoustic event detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(12): 2198-2208. |
30 | LiY X, JinH, LiW, et al. Fast speaker clustering using distance of feature matrix mean and adaptive convergence threshold[J]. IET Signal Processing, 2014, 8(8): 844-851. |
31 | MacQueenJ. Some methods for classification and analysis on multivariate observations[A]. The Fifth Berkeley Symposium on Mathematical Statistics and Probability[C]. Durham, NC, USA: Project Euclid, 1967. 281-297. |
32 | NirmalaG, ThyagharajanK K. A modern approach for image forgery detection using BRICH clustering based on normalised mean and standard deviation[A]. International Conference on Communication and Signal Processing (ICCSP)[C]. Chennai, India: IEEE, 2019. 441-444. |
33 | JingX X, ZhanL, ZhaoH, et al. Speaker recognition system using the improved GMM-based clustering algorithm[A]. International Conference on Intelligent Computing and Integrated Systems[C]. Guilin, China: IEEE, 2010. 482-485. |
[1] | 苏兆品, 吴张倩, 岳峰, 武钦芳, 张国富. 自然环境背景噪声下基于低维深度特征的手机来源识别[J]. 电子学报, 2021, 49(4): 637-646. |
[2] | 崔子豪, 鲍长春. 基于广义合成分析和深度神经网络的自回归系数估计方法[J]. 电子学报, 2021, 49(1): 29-39. |
[3] | 欧世峰, 赵艳磊, 宋鹏, 高颖. 基于概率耦合的双直接判决先验信噪比估计算法[J]. 电子学报, 2020, 48(8): 1605-1614. |
[4] | 唐海桃, 薛嘉宾, 韩纪庆. 一种多尺度前向注意力模型的语音识别方法[J]. 电子学报, 2020, 48(7): 1255-1260. |
[5] | 袁文浩, 胡少东, 时云龙, 李钊, 梁春燕. 一种用于语音增强的卷积门控循环网络[J]. 电子学报, 2020, 48(7): 1276-1283. |
[6] | 张二华, 王明合, 唐振民. 加性噪声条件下鲁棒说话人确认[J]. 电子学报, 2019, 47(6): 1244-1250. |
[7] | 袁文浩, 娄迎曦, 梁春燕, 夏斌. 利用生成噪声提高语音增强方法的泛化能力[J]. 电子学报, 2019, 47(4): 791-797. |
[8] | 陈楠, 鲍长春. 基于双耳线索编码原理的语音增强方法[J]. 电子学报, 2019, 47(1): 227-233. |
[9] | 袁文浩, 梁春燕, 夏斌, 孙文珠. 一种融合相位估计的深度卷积神经网络语音增强方法[J]. 电子学报, 2018, 46(10): 2359-2366. |
[10] | 杨绪魁, 屈丹, 张文林, 闫红刚. 基于长时信息的自适应话音激活检测[J]. 电子学报, 2018, 46(4): 878-885. |
[11] | 薛一鸣, 陈鹞, 何宁宁, 胡彩娥, 王建平. 基于DFT滤波器组的低时延FPGA语音处理实现研究[J]. 电子学报, 2018, 46(3): 695-701. |
[12] | 李艳雄, 王琴, 张雪, 邹领. 基于凝聚信息瓶颈的音频事件聚类方法[J]. 电子学报, 2017, 45(5): 1064-1071. |
[13] | 刘鑫, 鲍长春. 基于回声状态网络的音频频带扩展方法[J]. 电子学报, 2016, 44(11): 2758-2766. |
[14] | 白海钏, 鲍长春, 刘鑫. 基于局部最小二乘支持向量机的音频频带扩展方法[J]. 电子学报, 2016, 44(9): 2203-2210. |
[15] | 车滢霞, 俞一彪. 约束条件下的结构化高斯混合模型及非平行语料语音转换[J]. 电子学报, 2016, 44(9): 2282-2288. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||