电子学报 ›› 2022, Vol. 50 ›› Issue (10): 2517-2529.DOI: 10.12263/DZXB.20210265

• 学术论文 • 上一篇    下一篇

融合簇边界移动与自适应合成的混合采样算法

高雷阜1, 张梦瑶2, 赵世杰1   

  1. 1.辽宁工程技术大学运筹与优化研究院,辽宁 阜新 123000
    2.辽宁工程技术大学优化与决策研究所,辽宁 阜新 123000
  • 收稿日期:2021-02-22 修回日期:2021-10-21 出版日期:2022-10-25
    • 作者简介:
    • 高雷阜 男,1963年2月出生,辽宁阜新人.博士,教授,博士生导师.主要研究方向为最优化理论与方法、机器学习与数据分析.E-mail: gaoleifu@163.com
      张梦瑶 女,1996年1月出生,内蒙古呼伦贝尔人.硕士研究生.主要研究方向为机器学习与数据分析.E-mail: mengyaoz119@163.com
    • 基金资助:
    • 辽宁省教育厅重点攻关项目 (LJ2019ZL001)

Mixed-Sampling Algorithm Combining Cluster Boundary Movement and Adaptive Synthesis

GAO Lei-fu1, ZHANG Meng-yao2, ZHAO Shi-jie1   

  1. 1.Institute for Optimization and Decision Analytics,Liaoning Technical University,Fuxin,Liaoning 123000,China
    2.Institute of Optimization and Decision,Liaoning Technical University,Fuxin,Liaoning 123000,China
  • Received:2021-02-22 Revised:2021-10-21 Online:2022-10-25 Published:2022-10-11

摘要:

针对伪负采样算法(Pseudo-Negative Sampling,PNS)存在的类内子聚集和类别重叠问题,提出一种融合簇边界负样本移动策略(Cluster Boundary Negative Movement Strategy,CBNMS)与自适应正样本合成技术(Adaptive Positive Synthesis Technology,ADPST)的改进混合采样算法(Improved Cluster Boundary Negative Movement Strategy,ICBNMS),以提升非均衡数据的整体分类性能和正类识别精度.CBNMS策略采用凝聚层次聚类对正负类样本进行划分,并通过各局部样本间相似关系识别潜在负类中且与正类相关性较大的簇边界负样本,提高采样的局部精确性和时效性.为进一步加强CBNMS策略对正样本重叠区域的识别性能,ICBNMS算法在簇边界负样本移动均衡化基础上,引入ADPST技术,利用稀疏度与距离复合因子组合加权以自适应确定最优样本生成区域,从而有效削弱样本的重叠性且丰富样本的多样性.实验结果表明,相比其他采样算法,ICBNMS算法在10个非均衡数据集的多组实验中G-mean和F-measure等指标获得最优值,且时间效率比CDSMOTE和PNS算法分别提升了32.27%和27.88%,凸显出更优越的鲁棒性和泛化性.

关键词: 非均衡数据分类, 凝聚层次聚类, 簇边界负样本移动, 自适应正样本合成, 混合采样

Abstract:

For the problem of intra-class sub-gathering and class-overlapping in pseudo-negative sampling(PNS) algorithm, an improved mixed-sampling algorithm combining cluster boundary negative movement strategy(CBNMS) and adaptive positive synthesis technology(ADPST) is proposed to boost the overall classification performance and positive class identification accuracy of imbalanced data. The CBNMS strategy adopts AGENS(Agglomerative Hierarchical Cluster) to divide positive and negative samples, identifies the cluster boundary negative samples in the potential negative class with a large correlation with the positive class by the similar relationship between each local sample, and increases the local accuracy and timeliness of sampling. In order to further strengthen the identification performance of the CBNMS strategy for the overlap area of positive samples, the ICBNMS(Improved Cluster Boundary Negative Movement Strategy) algorithm introduces ADPST technology on the basis of moving equalization of negative samples at the cluster boundary and utilizes the combination of sparsity and distance composite factor weighting to adaptively determine the optimal sample generation area, thereby effectively weakening the overlap of samples and enriching the diversity of samples. Experiment results show that compared with other sampling algorithms, the ICBNMS algorithm can obtain the optimal values of G-mean, F-measure and other indicators in multiple experiments of 10 imbalanced data sets, and its time efficiency has improved by 32.27% and 27.88% respectively compared with the CDSMOTE and PNS algorithms, highlighting more superior robustness and generalization.

Key words: imbalanced data classification, agglomerative hierarchical cluster, cluster boundary negative sample movement, adaptive positive sample synthesis, mixed-sampling

中图分类号: