电子学报 ›› 2019, Vol. 47 ›› Issue (5): 1000-1008.DOI: 10.3969/j.issn.0372-2112.2019.05.004

• 学术论文 • 上一篇    下一篇

完全自适应的谱聚类算法

谢娟英, 丁丽娟   

  1. 陕西师范大学计算机科学学院, 陕西西安 710062
  • 收稿日期:2018-02-05 修回日期:2018-07-22 出版日期:2019-05-25
    • 作者简介:
    • 谢娟英 女,1971生于陕西西安,陕西师范大学计算机科学学院教授,主要研究领域为机器学习、数据挖掘和生物医学大数据分析.E-mail:xiejuany@snnu.edu.cn;丁丽娟 女,1994生于陕西安康,陕西师范大学计算机科学学院硕士研究生,主要研究领域为机器学习、数据挖掘.E-mail:lijuan_ding@163.com
    • 基金资助:
    • 国家自然科学基金 (No.61673251); 国家重点研发计划 (No.2016YFC0901900); 中央高校基本科研业务费专项资金 (No.GK2017010006); 研究生培养创新基金 (No.2015CX028,No.2016CSY009)

The True Self-adaptive Spectral Clustering Algorithms

XIE Juan-ying, DING Li-juan   

  1. School of Computer Science, Shaanxi Normal University, Xi'an, Shaanxi 710062, China
  • Received:2018-02-05 Revised:2018-07-22 Online:2019-05-25 Published:2019-05-25
    • Supported by:
    • National Natural Science Foundation of China (No.61673251); National Key Research and Development Program of China (No.2016YFC0901900); Fundamental Research Funds for the Central Universities (No.GK2017010006); Postgraduate Cultivation Innovation Fund (No.2015CX028, No.2016CSY009)

摘要: 针对谱聚类算法self-tuning的局部尺度参数σi会受噪音点影响,进而影响聚类结果,及其所使用的K-means算法的不稳定,对聚类结果的影响,提出两种完全自适应的谱聚类算法SC_SD(Spectral Clustering based on Standard Deviation)和SC_MD(Spectral Clustering based on Mean Distance),分别定义样本i的标准差、样本i到其余样本的距离均值,为样本i的邻域半径,统计邻域内的样本数,以样本i的邻域标准差为其局部尺度参数,避免样本i的局部尺度参数受噪音点影响,进而影响聚类结果;以方差优化初始聚类中心的SD_K-medoids算法代替K-means算法,克服K-means算法的不稳定,发现数据的真实分布.UCI数据集和人工数据集实验测试表明,提出的SC_SD和SC_MD算法能得到更优聚类结果,不受噪音点影响,有很好的伸缩性.提出的SC_SD和SC_MD能完全自适应地发现数据集的真实分布信息,尤其SC_MD算法很适合较大规模数据集的聚类分析.

关键词: 谱聚类, 邻域, 标准差, 均值, 自适应

Abstract: To avoid the clustering results with the local scaling parameter σi of self-tuning may be influenced by outliers,and the unstable clustering results from K-means in self-tuning,two true self-adaptive spectral clustering algorithms were proposed.The two spectral clustering algorithms are respectively named as SC_SD(Spectral Clustering based on Standard Deviation) and SC_MD(Spectral Clustering based on Mean Distance).They respectively define the standard deviation of point i,and the mean distance from point i to others,as its radius of neighborhood,then count the number of points in the neighborhood,and use the standard deviation of point i in the neighborhood as its local scaling parameter,so as to avoid the influence from outliers to the local scaling parameter σi of point i,and the distortion in clustering results of self-tuning.SD_K-medoids are adopted to instead of K-means in self-tuning to avoid the unstable clustering results of K-means,so as to get the true clustering of a dataset.The experimental results on UCI datasets and on synthetic datasets demonstrate that SC_SD and SC_MD can obtain better clustering results than that of traditional spectral clustering algorithm NJW and spectral clustering algorithm self-tuning,and are robust to noises,and has got good scalability.The proposed SC_SD and SC_MD can detect the clustering of a dataset without any given information,and the SC_MD can be used to detect the clustering of a comparable big data.

Key words: spectral clustering, neighborhood, standard deviation, mean distance, self-adaption

中图分类号: