电子学报 ›› 2021, Vol. 49 ›› Issue (10): 2041-2047.DOI: 10.12263/DZXB.20200573

• 学术论文 • 上一篇    下一篇

基于联合学习框架的音频场景聚类

张聿晗, 李艳雄, 江钟杰, 陈昊   

  1. 华南理工大学电子与信息学院,广东 广州 510640
  • 收稿日期:2020-06-15 修回日期:2020-10-04 出版日期:2021-11-29
    • 作者简介:
    • 张聿晗 男,1995年6月出生,安徽黄山人.现为华南理工大学电子与信息学院硕士研究生,主要研究方向为语音及音频信号处理、机器学习.
      李艳雄(通信作者) 男,1980年8月出生,湖南嘉禾人.现为华南理工大学电子与信息学院副教授、博士生导师,主要研究方向为语音及音频信号处理、机器学习、模式识别. E-mail: eeyxli@scut.edu.cn
    • 基金资助:
    • 国家自然科学基金 (61771200)

Audio Scene Clustering Based on Joint Learning Framework

ZHANG Yu-han, LI Yan-xiong, JIANG Zhong-jie, CHEN Hao   

  1. School of Electronic and Information Engineering, South China University of Technology, Guangzhou, Guangdong 510640, China
  • Received:2020-06-15 Revised:2020-10-04 Online:2021-11-29 Published:2021-10-25
    • Supported by:
    • National Natural Science Foundation of China (61771200)

摘要:

音频场景聚类的任务是将属于相同音频场景的音频样本合并到同一个类中.本文提出一种基于联合学习框架的音频场景聚类方法.该框架由一个卷积自编码网络(Convolution Autoencoder Network, CAN)与一个判别性聚类网络(Discriminative Clustering Network, DCN)组成.CAN包括编码器和译码器,用于提取深度变换特征,DCN用于对输入的深度变换特征进行类别估计从而实现音频场景聚类.采用DCASE-2017和LITIS-Rouen数据集作为实验数据,比较不同特征与聚类方法的性能.实验结果表明:采用归一化互信息和聚类精度作为评价指标时,基于联合学习框架提取的深度变换特征优于其他特征,本文方法优于其他方法.本文方法所需要付出的代价是需要较大的计算复杂度.

关键词: 音频场景聚类, 联合学习框架, 卷积自编码网络, 判别性聚类网络

Abstract:

Audio scene clustering (ASC) is a task to merge audio samples belonging to the same type of acoustic scene into a single cluster. This paper proposes a method of ASC based on joint learning framework. The proposed framework consists of a convolution autoencoder network (CAN) and a discriminative clustering network (DCN). The CAN is used to extract deep transformed feature (DTF), while the DCN is used to do cluster estimation on the input DTF for realizing ASC. Two datasets, DCASE-2017 and LITIS-Rouen, are used as experimental data, and the performance of different features and clustering methods are compared. Experimental results show that the DTF extracted by the joint learning framework outperforms other features, and our method is superior to other methods, in terms of the metrics of both normalized mutual information and clustering accuracy. The cost of the proposed method is the higher computational complexity.

Key words: audio scene clustering, joint learning framework, convolutional autoencoder network, discriminative clustering network

中图分类号: