SCEA: A Parallel Clustering Ensemble Algorithm for High-Dimensional Massive Data

廖彬; 黄静莱; 王鑫; 孙瑞娜; 葛晓燕; 国冰磊

doi:10.12263/DZXB.20191362

您当前的位置：

首页 >

文章列表页 >

SCEA: A Parallel Clustering Ensemble Algorithm for High-Dimensional Massive Data

更新时间：2025-12-08

- SCEA: A Parallel Clustering Ensemble Algorithm for High-Dimensional Massive Data
- Acta Electronica Sinica Vol. 49, Issue 6, Pages: 1077-1087(2021)
- 作者机构：
- 作者简介：
- 基金信息：
- DOI：10.12263/DZXB.20191362
  CLC： TP301.6
- Published：2021
- 稿件说明：
移动端阅览
廖彬, 黄静莱, 王鑫, et al. SCEA: A Parallel Clustering Ensemble Algorithm for High-Dimensional Massive Data[J]. Acta Electronica Sinica, 2021, 49(6): 1077-1087.
DOI：

廖彬, 黄静莱, 王鑫, et al. SCEA: A Parallel Clustering Ensemble Algorithm for High-Dimensional Massive Data[J]. Acta Electronica Sinica, 2021, 49(6): 1077-1087. DOI： 10.12263/DZXB.20191362.

摘要

针对传统串行聚类集成算法在处理高维海量数据时效率低下的问题，提出基于Spark的并行聚类集成算法SCEA（Spark based Clustering Ensemble Algorithm）.首先，通过主成分分析与成对约束结合的方法对算法输入数据进行预处理，达到数据降维并去除特征相关性的目的；其次，通过调用不同聚类算法获得基聚类成员后，采用三元组方法通过基聚类成员的簇标签构造出相似度矩阵，并调用层次聚类算法得到最终的聚类结果；最后，在调用MLlib中已有聚类算法的基础上，基于Scala对SCEA算法进行了实现.将SCEA与同类算法在多组数据集下进行对比测试，实验结果表明：总体上SCEA不仅较已有算法在准确率方面有所提高，并且通过分析运行时间、加速比以及可扩展性3个性能指标，证明了SCEA在算法性能上的优越性.

Abstract

In order to solve the problem of low efficiency in traditional serial clustering integration algorithm in processing high-dimensional massive data

we propose a parallel clustering integration algorithm named SCEA (Spark based Clustering Ensemble Algorithm) which is based on spark platform. The input data of the SCEA algorithm is preprocessed by the combination of principal component analysis and pairwise constraints

which can reduce the dimension of the data and remove the feature correlation. After obtaining the base clustering results using different clustering algorithms

similarity matrix is constructed by the cluster labels of the base cluster members based on the triple method

and the hierarchical clustering algorithm is used to get the final clustering results. On the basis of calling the existing clustering algorithm in the spark MLlib

the SCEA algorithm is implemented based on Scala language. The SCEA is compared with other similar algorithms in multiple data sets. The experimental results show that SCEA is not only improved in accuracy than existing algorithms

but also proves the superiority of SCEA in algorithm performance by analyzing three performance indexes: running time

speedup ratio and scalability.

关键词

Keywords

references

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Reading and Writing Performance Optimization of Cross-Language FUSE Framework

Short-Text Clustering Algorithm Based on Laplacian Graph

The Feature,Programming Model and Performance Optimization Strategy of Heterogeneous Many-Core System:A Review

An Approach for Rule-Based Performance Evolutionary Optimization at Software Architecture Level

A Research of Two Kinds of Mechanism and Performance Improvement of Resistive Switching Access Memory

Related Author

HUANG Yi-hua

DAI Hai-peng

WANG Zhao-kang

QIU Ling-wei

LUO Yi-li

GU Rong

FENG Kai

ZHU Lei

Related Institution

College of Computer Science and Technology， Nanjing University of Aeronautics and Astronautics

Department of Computer Science and Technology， Nanjing University

State Key Laboratory for Novel Software Technology， Nanjing University

School of Computer Science and Engineering, Xi’an University of Technology

Shaanxi Key Lab Network Computer and Security Technology

⁰