针对传统串行聚类集成算法在处理高维海量数据时效率低下的问题,提出基于Spark的并行聚类集成算法SCEA(Spark based Clustering Ensemble Algorithm).首先,通过主成分分析与成对约束结合的方法对算法输入数据进行预处理,达到数据降维并去除特征相关性的目的;其次,通过调用不同聚类算法获得基聚类成员后,采用三元组方法通过基聚类成员的簇标签构造出相似度矩阵,并调用层次聚类算法得到最终的聚类结果;最后,在调用MLlib中已有聚类算法的基础上,基于Scala对SCEA算法进行了实现.将SCEA与同类算法在多组数据集下进行对比测试,实验结果表明:总体上SCEA不仅较已有算法在准确率方面有所提高,并且通过分析运行时间、加速比以及可扩展性3个性能指标,证明了SCEA在算法性能上的优越性.
Abstract
In order to solve the problem of low efficiency in traditional serial clustering integration algorithm in processing high-dimensional massive data
we propose a parallel clustering integration algorithm named SCEA (Spark based Clustering Ensemble Algorithm) which is based on spark platform. The input data of the SCEA algorithm is preprocessed by the combination of principal component analysis and pairwise constraints
which can reduce the dimension of the data and remove the feature correlation. After obtaining the base clustering results using different clustering algorithms
similarity matrix is constructed by the cluster labels of the base cluster members based on the triple method
and the hierarchical clustering algorithm is used to get the final clustering results. On the basis of calling the existing clustering algorithm in the spark MLlib
the SCEA algorithm is implemented based on Scala language. The SCEA is compared with other similar algorithms in multiple data sets. The experimental results show that SCEA is not only improved in accuracy than existing algorithms
but also proves the superiority of SCEA in algorithm performance by analyzing three performance indexes: running time