电子学报 ›› 2020, Vol. 48 ›› Issue (1): 44-58.DOI: 10.3969/j.issn.0372-2112.2020.01.006

所属专题: 机器学习—特征选择

• 学术论文 • 上一篇    下一篇

具有同步化特征选择的迭代紧凑非平行支持向量聚类算法

方佳艳1,2, 刘峤1,2   

  1. 1. 电子科技大学信息与软件工程学院, 四川成都 611731;
    2. 中电科大数据研究院有限公司, 贵州贵阳 550022
  • 收稿日期:2018-04-09 修回日期:2019-04-18 出版日期:2020-01-25 发布日期:2020-01-25
  • 作者简介:方佳艳 女,1997年出生,安徽池州人.电子科技大学信息与软件工程学院本科生,主要研究方向为机器学习、人工智能、计算复杂性理论.E-mail:fyk80@163.com
  • 基金资助:
    国家自然科学基金(No.61772117);"十三五"装备预研领域基金(No.6140312010203);军委科技委前沿探索项目(No.1816321TS00105301);四川省科技服务业示范项目(No.2018GFW0150);中电集团公司第五十四研究所开放课题(No.185686)

Iterative Tighter Nonparallel Hyperplane Support Vector Clustering with Simultaneous Feature Selection

FANG Jia-yan1,2, LIU Qiao1,2   

  1. 1. School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, China;
    2. CETC Big Data Research Institute Co., Ltd., Guiyang, Guizhou 550022, China
  • Received:2018-04-09 Revised:2019-04-18 Online:2020-01-25 Published:2020-01-25

摘要: 本文提出了一种新的带有同步化特征选择的聚类算法,称为"具有同步化特征选择的迭代紧凑非平行支持向量聚类算法"(IT-NHSVC-SFS).在具有两个非平行超平面的学习模型中使用迭代(交替)优化算法完成聚类,同时引入两种类型的正则项,分别是欧几里得范数和无穷范数,欧几里得范数用于提升聚类模型的泛化能力,无穷范数实际上是对两个非平行超平面进行同步化地隐式特征抽取,从而降低来自于不相关特征的聚类噪音,保证了模型的聚类精度,并引入一组束缚变量(bounding variables)避免无穷范数的最大化操作,将非凸优化问题转化成二次凸优化问题.同时,由于新提出的模型体现着"最大间隔"的思想,因此具有良好的泛化能力.为了方便实现两个非平行超平面同步化的特征选择过程,文中将非平行超平面SVM(Nonparallel Hyperplane SVM,NHSVM)作为IT-NHSVC-SFS算法的基础模型,因此和TWSVM以及它的变体模型不同的是:只需要求解一个二次规划问题(QP问题)就可以同时得到两个最优超平面.同时,新算法在原有的NHSVM模型的约束条件集合中新添加了两组等式约束条件,从而无需进行原有模型中的两个大矩阵的求逆操作,降低了计算复杂度.此外,在IT-NHSVC-SFS模型中,用拉普拉斯损失函数(Laplacian loss measure)代替了NHSVM模型原有的铰链损失函数(hinge loss function),避免了算法早熟收敛(premature convergence).在一组标准数据集上的数值实验结果表明,相对于其他已有的聚类算法,IT-NHSVC-SFS算法在聚类精度方面具有更好的表现.

关键词: 聚类, 特征选择, 非平行超平面支持向量机, 无穷范数

Abstract: In this paper,a new clustering algorithm with simultaneous feature selection is proposed,which is called iterative tighter nonparallel support vector clustering with simultaneous feature selection(IT-NHSVC-SFS).In learning with two nonparallel hyperplanes model,we use the iterative (alternating) optimization algorithm to achieve clustering,and at the same time introduce two types of regularizes,the Euclidean norm and the infinite norm,respectively.Euclidean norm clustering model is used to improve the generalization ability and the infinite norm actually fulfills implicit feature extraction for the two nonparallel hyperplanes in order to reduce data noises from irrelevant features,and the clustering precision of the model is guaranteed.We also introduce a set of bounding variables to avoid maximization operation of the infinite norm,converting the non-convex optimization problem into a quadratic convex optimization problem.Meanwhile,because the new model embodies the idea of "maximum margin",it has good generalization ability.IT-NHSVC-SFS chooses nonparallel hyperplanes SVM (NHSVM) as the basis of the algorithm model.Unlike TWSVM and its variant models,only a quadratic programming problem(QP problem) needs to be solved to get the two optimal hyperplane simultaneously.This property is helpful to design a synchronous feature selection process for two nonparallel hyperplanes.The new algorithm adds two sets of equality constraints in the constraint set of the original NHSVM model,which can avoid the inverse operation of two large matrices and reduce the computational complexity.In addition,in the IT-NHSVC-SFS model,the Laplacian loss function replaces the original hinge loss function in NHSVM to avoid premature convergence.Numerical experiments on a set of benchmark data sets show that IT-NHSVC-SFS algorithm performs better in terms of clustering accuracy than other existing clustering algorithms.

Key words: clustering, feature selection, nonparallel hyperplane support vector machine, L-infinite norm

中图分类号: