摘要 针对谱聚类算法self-tuning的局部尺度参数σi会受噪音点影响,进而影响聚类结果,及其所使用的K-means算法的不稳定,对聚类结果的影响,提出两种完全自适应的谱聚类算法SC_SD(Spectral Clustering based on Standard Deviation)和SC_MD(Spectral Clustering based on Mean Distance),分别定义样本i的标准差、样本i到其余样本的距离均值,为样本i的邻域半径,统计邻域内的样本数,以样本i的邻域标准差为其局部尺度参数,避免样本i的局部尺度参数受噪音点影响,进而影响聚类结果;以方差优化初始聚类中心的SD_K-medoids算法代替K-means算法,克服K-means算法的不稳定,发现数据的真实分布.UCI数据集和人工数据集实验测试表明,提出的SC_SD和SC_MD算法能得到更优聚类结果,不受噪音点影响,有很好的伸缩性.提出的SC_SD和SC_MD能完全自适应地发现数据集的真实分布信息,尤其SC_MD算法很适合较大规模数据集的聚类分析.
Abstract:To avoid the clustering results with the local scaling parameter σi of self-tuning may be influenced by outliers,and the unstable clustering results from K-means in self-tuning,two true self-adaptive spectral clustering algorithms were proposed.The two spectral clustering algorithms are respectively named as SC_SD(Spectral Clustering based on Standard Deviation) and SC_MD(Spectral Clustering based on Mean Distance).They respectively define the standard deviation of point i,and the mean distance from point i to others,as its radius of neighborhood,then count the number of points in the neighborhood,and use the standard deviation of point i in the neighborhood as its local scaling parameter,so as to avoid the influence from outliers to the local scaling parameter σi of point i,and the distortion in clustering results of self-tuning.SD_K-medoids are adopted to instead of K-means in self-tuning to avoid the unstable clustering results of K-means,so as to get the true clustering of a dataset.The experimental results on UCI datasets and on synthetic datasets demonstrate that SC_SD and SC_MD can obtain better clustering results than that of traditional spectral clustering algorithm NJW and spectral clustering algorithm self-tuning,and are robust to noises,and has got good scalability.The proposed SC_SD and SC_MD can detect the clustering of a dataset without any given information,and the SC_MD can be used to detect the clustering of a comparable big data.
[1] Xie J Y,Jiang S,Xie W X,et al.An efficient global K-means clustering algorithm[J].JCP,2011,6(2):271-279.
[2] Han J W,Kamber M.Data Mining:Concepts and Techniques[M].Beijing:China Machine Press,2006.83-84.
[3] Sajana T,Rani C M S,Narayana K V.A survey on clustering techniques for big data mining[J].Indian Journal of Science and Technology,2016,9(3):1-12.
[4] Rai P,Dwivedi R K.Clustering techniques for unsupervised Learning[J].International Journal of Management,IT and Engineering,2012,2(11):462-571.
[5] Macqueen J.Some methods for classification and analysis of multivariate observations[A].Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability[C].Berkeley:University of California Press,1967.281-297.
[6] Fu G Y.Optimization methods for fuzzy clustering[J].Fuzzy Sets and Systems,1998,93(3):301-309.
[7] Kaufman L,Rousseeuw P J.Finding Groups in Data:An Introduction to Cluster Analysis[M].New York,USA:John Wiley & Sons,1990.126-163.
[8] Luxburg U V.A tutorial on spectral clustering[J].Statistics and Computing,2007,17(4):395-416.
[9] Perona P,Freenman W.A factorization approach to grouping[A].European Conference on Computer Vision[C].Berlin:Springer,1998.655-670.
[10] Shi J,Malik J.Normalized cuts and image segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905.
[11] Ng A Y,Jordan M I,Weiss Y.On spectral clustering:analysis and an algorithm[A].Advances in Neural Information Processing Systems(NIPS14)[C].Cambridge MA:MIT Press,2002.849-856.
[12] Zelnik-Manor L,Perona P.Self-tuning spectral clustering[A].Advances in Neural Information Processing Systems(NIPS17)[C].Cambridge MA:MIT Press,2005.1601-1608.
[13] 王玲,薄列峰,焦李成.密度敏感的谱聚类[J].电子学报,2007,35(8):1577-1581. WANG Ling,BO Lie-feng,JIAO Li-cheng.Density sesitive spectral clustering[J].Acta Electronica Sinica,2007,35(8):1577-1581.(in Chinese)
[14] 王娜,李霞.基于监督信息特性的主动半监督谱聚类算法[J].电子学报,2010,38(1):172-176. WANG Na,LI Xia.Activesemi-supervised spectral clustering based on pairwise constraints[J].Acta Electronica Sinica,2010,38(1):172-176.(in Chinese)
[15] Xie J Y,Zhou Y,Ding L J.Local standard deviation spectral clustering[A].2018 IEEE International Conference on Big Data and Smart Computing(BigComp)[C].Piscataway:IEEE,2018.242-250.
[16] 谢娟英,高瑞.方差优化初始中心的K-medoids聚类算法[J].计算机科学与探索,2015,9(8):973-984. Xie Juanying,Gao Rui.K-medoids clustering algorithms with optimized initial seeds by variance[J].Journal of Frontiers of Computer Science and Technology,2015,9(8):973-984.(in Chinese)
[17] Blake C,Merz C J.UCI Repository of machine learning databases[DB/OL].http://www.ics.uci.edu/~mlearn/MLRepository.html,1998-04-02.
[18] Vinh N X,Epps J,Bailey J.Information theoretic measures for clusterings comparison:Variants,properties,normalization and correction for chance[J].Journal of Machine Learning Research,2010,11(Oct):2837-2854.
[19] Hubert L,Arabie P.Comparing partitions[J].Journal of classification,1985,2(1):193-218.
[20] 卜东波,白硕.聚类分类中的粒度原理[J].计算机学报,2002,25(8):810-816. Piao Dongbo,Bai Shuo.Principle of granularity in clustering and classification[J].Chinese Journal of Computers,2002,25(8):810-816.(in Chinese)
[21] Chang H,Yeung D Y.Robust path-based spectral clustering[J].Pattern Recognition,2008,41(1):191-203.
[22] Fränti P,Virmajoki O.Iterative shrinking method for clustering problems[J].Pattern Recognition,2006,39(5):761-775.