电子学报 ›› 2022, Vol. 50 ›› Issue (11): 2730-2737.DOI: 10.12263/DZXB.20211273

• 学术论文 • 上一篇    下一篇

面向流形数据的测地距离与余弦互逆近邻密度峰值聚类算法

赵嘉, 王刚, 吕莉, 樊棠怀   

  1. 南昌工程学院信息工程学院,江西 南昌 330099
  • 收稿日期:2021-09-17 修回日期:2022-01-01 出版日期:2022-11-25
    • 作者简介:
    • 赵 嘉 男, 1981年9月出生于江西省九江市. 现为南昌工程学院教授、 硕士生导师. 主要研究方向为机器学习、 数据挖掘和智能计算.E-mail: zhaojia925@163.com
      王 刚 男, 1995年8月出生于江西省赣州市. 现为南昌工程学院在读硕士研究生. 主要研究方向为机器学习和数据挖掘.E-mail: wang691630202@163.com
      吕 莉 女, 1982年5月出生于江西省贵溪市. 现为南昌工程学院教授、 硕士生导师. 主要研究方向为大数据分析和目标跟踪. E-mail: lvli@nit.edu.cn
      樊棠怀 男, 1962年11月出生于江西省九江市. 现为南昌工程学教授、 硕士生导师. 主要研究方向为传感器信息获取与处理、 机器学习和数据挖掘. E-mail: fantanghuai@163.com
    • 基金资助:
    • 国家自然科学基金 (52069014); 江西省重点研发计划项目 (20192BBE50076)

Density Peaks Clustering Algorithm Based on Geodesic Distance and Cosine Mutual Reverse Nearest Neighbors for Manifold Datasets

ZHAO Jia, WANG Gang, LÜ Li, FAN Tang-huai   

  1. School of Information Engineering,Nanchang Institute of Technology,Nanchang,Jiangxi 330099,China
  • Received:2021-09-17 Revised:2022-01-01 Online:2022-11-25 Published:2022-11-19

摘要:

密度峰值聚类算法倾向在球形分布数据中选择密度峰值,而流形数据多呈非球形分布,导致不能准确找到数据的类簇中心.该算法的分配策略优先对类簇中心附近的样本进行链式分配,而流形数据大量样本远离其类簇中心,导致本应属于同一类簇的样本被错误分配.为此,本文提出一种面向流形数据的测地距离与余弦互逆近邻密度峰值聚类算法.将K近邻与测地距离结合并重新定义局部密度,凸显密度峰值与非密度峰值的差异,准确找到类簇中心;将互逆近邻和余弦相似性相结合,得到基于余弦互逆近邻的样本相似度矩阵,为流形类簇准确分配样本.实验结果表明,本算法能有效发现流形数据集的几何形状并准确聚类,对真实数据集和图像数据集的聚类效果优秀.

关键词: 密度峰值, 聚类, K近邻, 互逆近邻, 局部密度, 分配策略

Abstract:

The density peaks clustering algorithm tends to select the density peaks in the spherical distribution data, while the manifold data are mostly non spherical distribution, resulting in the inability to accurately find the cluster centers. The allocation strategy of the algorithm gives priority to the chain allocation of samples near the cluster centers, while a large number of samples of manifold data are far away from the cluster centers, resulting in the wrong allocation of samples that should belong to the same cluster. Therefore, this paper proposes a density peaks clustering algorithm based on geodesic distance and cosine mutual reverse nearest neighbors for manifold datasets. Combining K-nearest neighbors with geodesic distance and redefining local density, highlighting the difference between density peaks and non density peaks, accurately find the cluster centers; combining the mutual reverse nearest neighbors and cosine similarity, the sample similarity matrix based on cosine mutual reverse nearest neighbors is obtained, which can accurately allocate samples for manifold clusters. The experimental results show that the algorithm can effectively find the geometry structure of manifold datasets, and has excellent clustering effect on real datasets and picture datasets.

Key words: density peaks, clustering, K-nearest neighbors, mutual reverse nearest neighbors, local density, allocation strategy

中图分类号: