电子学报 ›› 2022, Vol. 50 ›› Issue (10): 2517-2529.DOI: 10.12263/DZXB.20210265
高雷阜1, 张梦瑶2, 赵世杰1
收稿日期:
2021-02-22
修回日期:
2021-10-21
出版日期:
2022-10-25
作者简介:
基金资助:
GAO Lei-fu1, ZHANG Meng-yao2, ZHAO Shi-jie1
Received:
2021-02-22
Revised:
2021-10-21
Online:
2022-10-25
Published:
2022-10-11
摘要:
针对伪负采样算法(Pseudo-Negative Sampling,PNS)存在的类内子聚集和类别重叠问题,提出一种融合簇边界负样本移动策略(Cluster Boundary Negative Movement Strategy,CBNMS)与自适应正样本合成技术(Adaptive Positive Synthesis Technology,ADPST)的改进混合采样算法(Improved Cluster Boundary Negative Movement Strategy,ICBNMS),以提升非均衡数据的整体分类性能和正类识别精度.CBNMS策略采用凝聚层次聚类对正负类样本进行划分,并通过各局部样本间相似关系识别潜在负类中且与正类相关性较大的簇边界负样本,提高采样的局部精确性和时效性.为进一步加强CBNMS策略对正样本重叠区域的识别性能,ICBNMS算法在簇边界负样本移动均衡化基础上,引入ADPST技术,利用稀疏度与距离复合因子组合加权以自适应确定最优样本生成区域,从而有效削弱样本的重叠性且丰富样本的多样性.实验结果表明,相比其他采样算法,ICBNMS算法在10个非均衡数据集的多组实验中G-mean和F-measure等指标获得最优值,且时间效率比CDSMOTE和PNS算法分别提升了32.27%和27.88%,凸显出更优越的鲁棒性和泛化性.
中图分类号:
高雷阜, 张梦瑶, 赵世杰. 融合簇边界移动与自适应合成的混合采样算法[J]. 电子学报, 2022, 50(10): 2517-2529.
GAO Lei-fu, ZHANG Meng-yao, ZHAO Shi-jie. Mixed-Sampling Algorithm Combining Cluster Boundary Movement and Adaptive Synthesis[J]. Acta Electronica Sinica, 2022, 50(10): 2517-2529.
Id | Dataset | #Att. | #Ex. | #Pos. | #Neg. | IR |
---|---|---|---|---|---|---|
1 | Ecoli | 7 | 336 | 35 | 301 | 8.60 |
2 | SatImage | 36 | 6 435 | 626 | 5 809 | 9.28 |
3 | Yest_ME2 | 8 | 1 484 | 51 | 1 433 | 28.10 |
4 | yeast1289vs7 | 8 | 947 | 30 | 917 | 30.56 |
5 | yeast4 | 8 | 1 484 | 51 | 1 433 | 28.41 |
6 | yeast5 | 8 | 1 484 | 44 | 1 440 | 32.78 |
7 | abalone19 | 8 | 4 174 | 32 | 4 142 | 129.44 |
8 | poker89vs5 | 9 | 3 316 | 49 | 3 267 | 66.67 |
9 | shuttle2vs5 | 10 | 2075 | 25 | 2005 | 82.00 |
10 | yeast6 | 8 | 1 484 | 35 | 1 449 | 41.40 |
表1 非均衡数据集信息
Id | Dataset | #Att. | #Ex. | #Pos. | #Neg. | IR |
---|---|---|---|---|---|---|
1 | Ecoli | 7 | 336 | 35 | 301 | 8.60 |
2 | SatImage | 36 | 6 435 | 626 | 5 809 | 9.28 |
3 | Yest_ME2 | 8 | 1 484 | 51 | 1 433 | 28.10 |
4 | yeast1289vs7 | 8 | 947 | 30 | 917 | 30.56 |
5 | yeast4 | 8 | 1 484 | 51 | 1 433 | 28.41 |
6 | yeast5 | 8 | 1 484 | 44 | 1 440 | 32.78 |
7 | abalone19 | 8 | 4 174 | 32 | 4 142 | 129.44 |
8 | poker89vs5 | 9 | 3 316 | 49 | 3 267 | 66.67 |
9 | shuttle2vs5 | 10 | 2075 | 25 | 2005 | 82.00 |
10 | yeast6 | 8 | 1 484 | 35 | 1 449 | 41.40 |
预测为正类 | 预测为负类 | |
---|---|---|
实际为正类 | 真正TP(True Positive) | 假负FN(False Negative) |
实际为负类 | 假正FP(False Positive) | 真负TN(True Negative) |
表2 混淆矩阵
预测为正类 | 预测为负类 | |
---|---|---|
实际为正类 | 真正TP(True Positive) | 假负FN(False Negative) |
实际为负类 | 假正FP(False Positive) | 真负TN(True Negative) |
数据集 | 算法 | SVM | RF | KNN | |||
---|---|---|---|---|---|---|---|
G-mean | F-measure | G-mean | F-measure | G-mean | F-measure | ||
Ecoli | CBNMS | 0.644 | 0.957 | 0.709 | 0.960 | 0.666 | 0.956 |
ADPST | 0.891 | 0.959 | 0.917 | 0.963 | 0.876 | 0.955 | |
ICBNMS | 0.960 | 0.963 | 0.963 | 0.963 | 0.935 | 0.949 | |
SatImage | CBNMS | 0.724 | 0.968 | 0.723 | 0.969 | 0.803 | 0.967 |
ADPST | 0.902 | 0.965 | 0.941 | 0.947 | 0.912 | 0.960 | |
ICBNMS | 0.948 | 0.953 | 0.944 | 0.947 | 0.936 | 0.942 | |
Yest_ME2 | CBNMS | 0.543 | 0.982 | 0.435 | 0.982 | 0.382 | 0.983 |
ADPST | 0.887 | 0.981 | 0.901 | 0.917 | 0.789 | 0.979 | |
ICBNMS | 0.931 | 0.920 | 0.900 | 0.885 | 0.898 | 0.883 | |
yeast1289vs7 | CBNMS | 0.714 | 0.984 | 0.359 | 0.984 | 0.660 | 0.983 |
ADPST | 0.873 | 0.978 | 0.901 | 0.917 | 0.804 | 0.981 | |
ICBNMS | 0.904 | 0.937 | 0.929 | 0.913 | 0.862 | 0.866 | |
yeast4 | CBNMS | 0.643 | 0.982 | 0.447 | 0.984 | 0.411 | 0.983 |
ADPST | 0.786 | 0.983 | 0.763 | 0.989 | 0.859 | 0.981 | |
ICBNMS | 0.903 | 0.922 | 0.932 | 0.888 | 0.897 | 0.884 | |
yeast5 | CBNMS | 0.688 | 0.990 | 0.789 | 0.991 | 0.782 | 0.991 |
ADPST | 0.901 | 0.991 | 0.900 | 0.885 | 0.924 | 0.990 | |
ICBNMS | 0.986 | 0.988 | 0.988 | 0.989 | 0.950 | 0.989 |
表3 CBNMS,ADPST及ICBNMS算法在不同分类器上的评价指标结果
数据集 | 算法 | SVM | RF | KNN | |||
---|---|---|---|---|---|---|---|
G-mean | F-measure | G-mean | F-measure | G-mean | F-measure | ||
Ecoli | CBNMS | 0.644 | 0.957 | 0.709 | 0.960 | 0.666 | 0.956 |
ADPST | 0.891 | 0.959 | 0.917 | 0.963 | 0.876 | 0.955 | |
ICBNMS | 0.960 | 0.963 | 0.963 | 0.963 | 0.935 | 0.949 | |
SatImage | CBNMS | 0.724 | 0.968 | 0.723 | 0.969 | 0.803 | 0.967 |
ADPST | 0.902 | 0.965 | 0.941 | 0.947 | 0.912 | 0.960 | |
ICBNMS | 0.948 | 0.953 | 0.944 | 0.947 | 0.936 | 0.942 | |
Yest_ME2 | CBNMS | 0.543 | 0.982 | 0.435 | 0.982 | 0.382 | 0.983 |
ADPST | 0.887 | 0.981 | 0.901 | 0.917 | 0.789 | 0.979 | |
ICBNMS | 0.931 | 0.920 | 0.900 | 0.885 | 0.898 | 0.883 | |
yeast1289vs7 | CBNMS | 0.714 | 0.984 | 0.359 | 0.984 | 0.660 | 0.983 |
ADPST | 0.873 | 0.978 | 0.901 | 0.917 | 0.804 | 0.981 | |
ICBNMS | 0.904 | 0.937 | 0.929 | 0.913 | 0.862 | 0.866 | |
yeast4 | CBNMS | 0.643 | 0.982 | 0.447 | 0.984 | 0.411 | 0.983 |
ADPST | 0.786 | 0.983 | 0.763 | 0.989 | 0.859 | 0.981 | |
ICBNMS | 0.903 | 0.922 | 0.932 | 0.888 | 0.897 | 0.884 | |
yeast5 | CBNMS | 0.688 | 0.990 | 0.789 | 0.991 | 0.782 | 0.991 |
ADPST | 0.901 | 0.991 | 0.900 | 0.885 | 0.924 | 0.990 | |
ICBNMS | 0.986 | 0.988 | 0.988 | 0.989 | 0.950 | 0.989 |
数据集 | 评价指标 | |||
---|---|---|---|---|
Ecoli | AUC | 0.972 | 0.981 | 0.986 |
G-mean | 0.916 | 0.941 | 0.952 | |
F-measure | 0.958 | 0.957 | 0.957 | |
SatImage | AUC | 0.988 | 0.989 | 0.989 |
G-mean | 0.924 | 0.931 | 0.934 | |
F-measure | 0.979 | 0.980 | 0.980 | |
Yest_ME2 | AUC | 0.989 | 0.994 | 0.996 |
G-mean | 0.933 | 0.963 | 0.973 | |
F-measure | 0.983 | 0.983 | 0.982 | |
yeast1289vs7 | AUC | 0.972 | 0.985 | 0.992 |
G-mean | 0.937 | 0.958 | 0.969 | |
F-measure | 0.982 | 0.980 | 0.979 | |
yeast4 | AUC | 0.989 | 0.994 | 0.995 |
G-mean | 0.929 | 0.962 | 0.973 | |
F-measure | 0.982 | 0.983 | 0.982 | |
yeast5 | AUC | 0.984 | 0.981 | 0.984 |
G-mean | 0.779 | 0.786 | 0.754 | |
F-measure | 0.991 | 0.991 | 0.991 |
表4 不同采样规模调节系数ε下的指标结果
数据集 | 评价指标 | |||
---|---|---|---|---|
Ecoli | AUC | 0.972 | 0.981 | 0.986 |
G-mean | 0.916 | 0.941 | 0.952 | |
F-measure | 0.958 | 0.957 | 0.957 | |
SatImage | AUC | 0.988 | 0.989 | 0.989 |
G-mean | 0.924 | 0.931 | 0.934 | |
F-measure | 0.979 | 0.980 | 0.980 | |
Yest_ME2 | AUC | 0.989 | 0.994 | 0.996 |
G-mean | 0.933 | 0.963 | 0.973 | |
F-measure | 0.983 | 0.983 | 0.982 | |
yeast1289vs7 | AUC | 0.972 | 0.985 | 0.992 |
G-mean | 0.937 | 0.958 | 0.969 | |
F-measure | 0.982 | 0.980 | 0.979 | |
yeast4 | AUC | 0.989 | 0.994 | 0.995 |
G-mean | 0.929 | 0.962 | 0.973 | |
F-measure | 0.982 | 0.983 | 0.982 | |
yeast5 | AUC | 0.984 | 0.981 | 0.984 |
G-mean | 0.779 | 0.786 | 0.754 | |
F-measure | 0.991 | 0.991 | 0.991 |
数据集 | 评价指标 | |||||
---|---|---|---|---|---|---|
Ecoli | AUC | 0.984 | 0.983 | 0.985 | 0.984 | 0.984 |
G-mean | 0.948 | 0.950 | 0.951 | 0.949 | 0.949 | |
F-measure | 0.955 | 0.956 | 0.957 | 0.955 | 0.955 | |
SatImage | AUC | 0.993 | 0.993 | 0.995 | 0.994 | 0.994 |
G-mean | 0.957 | 0.956 | 0.959+0.000 | 0.957+0.000 | 0.957+0.000 | |
F-measure | 0.970 | 0.971 | 0.974+0.000 | 0.974+0.000 | 0.970+0.000 | |
Yest_ME2 | AUC | 0.995 | 0.995 | 0.996 | 0.995 | 0.996 |
G-mean | 0.973 | 0.973 | 0.974 | 0.972 | 0.973 | |
F-measure | 0.978 | 0.982 | 0.983 | 0.981 | 0.982 | |
yeast1289vs7 | AUC | 0.991 | 0.990 | 0.992 | 0.991 | 0.990 |
G-mean | 0.967 | 0.967 | 0.969 | 0.966 | 0.969 | |
F-measure | 0.977 | 0.978 | 0.980 | 0.978 | 0.977 | |
yeast4 | AUC | 0.989 | 0.992 | 0.996 | 0.995 | 0.994 |
G-mean | 0.972 | 0.972 | 0.975 | 0.972 | 0.973 | |
F-measure | 0.979 | 0.977 | 0.983 | 0.981 | 0.982 | |
yeast5 | AUC | 0.980 | 0.982 | 0.986 | 0.982 | 0.979 |
G-mean | 0.764 | 0.777 | 0.786 | 0.785 | 0.767 | |
F-measure | 0.984 | 0.981 | 0.992 | 0.991 | 0.991 |
表5 不同距离与稀疏度调节系数组合(α,β)下的指标结果
数据集 | 评价指标 | |||||
---|---|---|---|---|---|---|
Ecoli | AUC | 0.984 | 0.983 | 0.985 | 0.984 | 0.984 |
G-mean | 0.948 | 0.950 | 0.951 | 0.949 | 0.949 | |
F-measure | 0.955 | 0.956 | 0.957 | 0.955 | 0.955 | |
SatImage | AUC | 0.993 | 0.993 | 0.995 | 0.994 | 0.994 |
G-mean | 0.957 | 0.956 | 0.959+0.000 | 0.957+0.000 | 0.957+0.000 | |
F-measure | 0.970 | 0.971 | 0.974+0.000 | 0.974+0.000 | 0.970+0.000 | |
Yest_ME2 | AUC | 0.995 | 0.995 | 0.996 | 0.995 | 0.996 |
G-mean | 0.973 | 0.973 | 0.974 | 0.972 | 0.973 | |
F-measure | 0.978 | 0.982 | 0.983 | 0.981 | 0.982 | |
yeast1289vs7 | AUC | 0.991 | 0.990 | 0.992 | 0.991 | 0.990 |
G-mean | 0.967 | 0.967 | 0.969 | 0.966 | 0.969 | |
F-measure | 0.977 | 0.978 | 0.980 | 0.978 | 0.977 | |
yeast4 | AUC | 0.989 | 0.992 | 0.996 | 0.995 | 0.994 |
G-mean | 0.972 | 0.972 | 0.975 | 0.972 | 0.973 | |
F-measure | 0.979 | 0.977 | 0.983 | 0.981 | 0.982 | |
yeast5 | AUC | 0.980 | 0.982 | 0.986 | 0.982 | 0.979 |
G-mean | 0.764 | 0.777 | 0.786 | 0.785 | 0.767 | |
F-measure | 0.984 | 0.981 | 0.992 | 0.991 | 0.991 |
数据集 | 算法 | AUC | G-mean | F-measure | PPV | TPR | TNR | NPV |
---|---|---|---|---|---|---|---|---|
Ecoli | NCL | 0.928 | 0.820 | 0.944 | 0.969 | 0.921 | 0.748 | 0.538 |
SMOTE | 0.917 | 0.729 | 0.949 | 0.951 | 0.950 | 0.582 | 0.590 | |
CDSMOTE | 0.960 | 0.903 | 0.686 | 0.914 | 0.852 | 0.908 | 0.431 | |
PNS | 0.929 | 0.703 | 0.960 | 0.946 | 0.974 | 0.525 | 0.788 | |
CBNMS | 0.929 | 0.709 | 0.960 | 0.947 | 0.974 | 0.531 | 0.734 | |
ADPST | 0.976 | 0.917 | 0.963 | 0.969 | 0.957 | 0.882 | 0.849 | |
ICBNMS | 0.985 | 0.942 | 0.950 | 0.968 | 0.932 | 0.952 | 0.904 | |
SatImage | NCL | 0.954 | 0.799 | 0.961 | 0.963 | 0.960 | 0.666 | 0.645 |
SMOTE | 0.960 | 0.773 | 0.959 | 0.959 | 0.976 | 0.613 | 0.740 | |
CDSMOTE | 0.961 | 0.808 | 0.965 | 0.965 | 0.966 | 0.677 | 0.687 | |
PNS | 0.956 | 0.717 | 0.968 | 0.950 | 0.988 | 0.521 | 0.827 | |
CBNMS | 0.956 | 0.723 | 0.969 | 0.951 | 0.988 | 0.531 | 0.833 | |
ADPST | 0.989 | 0.941 | 0.947 | 0.946 | 0.948 | 0.934 | 0.936 | |
ICBNMS | 0.993 | 0.957 | 0.970 | 0.956 | 0.984 | 0.975 | 0.930 | |
Yest_ME2 | NCL | 0.905 | 0.558 | 0.980 | 0.976 | 0.985 | 0.329 | 0.478 |
SMOTE | 0.909 | 0.465 | 0.904 | 0.973 | 0.990 | 0.281 | 0.516 | |
CDSMOTE | 0.916 | 0.647 | 0.973 | 0.980 | 0.967 | 0.445 | 0.335 | |
PNS | 0.895 | 0.341 | 0.983 | 0.972 | 0.994 | 0.207 | 0.655 | |
CBNMS | 0.908 | 0.435 | 0.983 | 0.972 | 0.995 | 0.217 | 0.601 | |
ADPST | 0.968 | 0.897 | 0.882 | 0.887 | 0.879 | 0.916 | 0.911 | |
ICBNMS | 0.996 | 0.973 | 0.982 | 0.972 | 0.993 | 0.954 | 0.989 | |
yeast 1289vs7 | NCL | 0.720 | 0.375 | 0.983 | 0.974 | 0.993 | 0.193 | 0.460 |
SMOTE | 0.759 | 0.327 | 0.913 | 0.974 | 0.991 | 0.200 | 0.422 | |
CDSMOTE | 0.736 | 0.602 | 0.552 | 0.549 | 0.668 | 0.666 | 0.579 | |
PNS | 0.719 | 0.353 | 0.984 | 0.971 | 0.996 | 0.166 | 0.643 | |
CBNMS | 0.709 | 0.309 | 0.980 | 0.972 | 0.995 | 0.146 | 0.476 | |
ADPST | 0.968 | 0.901 | 0.917 | 0.899 | 0.937 | 0.866 | 0.918 | |
ICBNMS | 0.991 | 0.969 | 0.984 | 0.969 | 0.990 | 0.949 | 0.984 | |
yeast4 | NCL | 0.909 | 0.553 | 0.881 | 0.972 | 0.986 | 0.341 | 0.511 |
SMOTE | 0.898 | 0.441 | 0.873 | 0.973 | 0.990 | 0.239 | 0.455 | |
CDSMOTE | 0.958 | 0.794 | 0.812 | 0.842 | 0.813 | 0.822 | 0.138 | |
PNS | 0.908 | 0.436 | 0.983 | 0.972 | 0.995 | 0.207 | 0.650 | |
CBNMS | 0.901 | 0.447 | 0.984 | 0.972 | 0.996 | 0.215 | 0.727 | |
ADPST | 0.967 | 0.900 | 0.885 | 0.881 | 0.890 | 0.911 | 0.918 | |
ICBNMS | 0.996 | 0.973 | 0.982 | 0.976 | 0.992 | 0.954 | 0.987 | |
yeast5 | NCL | 0.984 | 0.882 | 0.889 | 0.893 | 0.990 | 0.792 | 0.725 |
SMOTE | 0.977 | 0.840 | 0.887 | 0.891 | 0.992 | 0.739 | 0.750 | |
CDSMOTE | 0.984 | 0.969 | 0.940 | 0.987 | 0.930 | 0.958 | 0.331 | |
PNS | 0.984 | 0.791 | 0.990 | 0.985 | 0.994 | 0.644 | 0.850 | |
CBNMS | 0.983 | 0.789 | 0.989 | 0.986 | 0.994 | 0.635 | 0.779 | |
ADPST | 0.982 | 0.763 | 0.989 | 0.987 | 0.995 | 0.602 | 0.800 | |
ICBNMS | 0.998 | 0.978 | 0.991 | 0.986 | 0.992 | 0.965 | 0.980 |
表6 其他采样算法的评价指标结果
数据集 | 算法 | AUC | G-mean | F-measure | PPV | TPR | TNR | NPV |
---|---|---|---|---|---|---|---|---|
Ecoli | NCL | 0.928 | 0.820 | 0.944 | 0.969 | 0.921 | 0.748 | 0.538 |
SMOTE | 0.917 | 0.729 | 0.949 | 0.951 | 0.950 | 0.582 | 0.590 | |
CDSMOTE | 0.960 | 0.903 | 0.686 | 0.914 | 0.852 | 0.908 | 0.431 | |
PNS | 0.929 | 0.703 | 0.960 | 0.946 | 0.974 | 0.525 | 0.788 | |
CBNMS | 0.929 | 0.709 | 0.960 | 0.947 | 0.974 | 0.531 | 0.734 | |
ADPST | 0.976 | 0.917 | 0.963 | 0.969 | 0.957 | 0.882 | 0.849 | |
ICBNMS | 0.985 | 0.942 | 0.950 | 0.968 | 0.932 | 0.952 | 0.904 | |
SatImage | NCL | 0.954 | 0.799 | 0.961 | 0.963 | 0.960 | 0.666 | 0.645 |
SMOTE | 0.960 | 0.773 | 0.959 | 0.959 | 0.976 | 0.613 | 0.740 | |
CDSMOTE | 0.961 | 0.808 | 0.965 | 0.965 | 0.966 | 0.677 | 0.687 | |
PNS | 0.956 | 0.717 | 0.968 | 0.950 | 0.988 | 0.521 | 0.827 | |
CBNMS | 0.956 | 0.723 | 0.969 | 0.951 | 0.988 | 0.531 | 0.833 | |
ADPST | 0.989 | 0.941 | 0.947 | 0.946 | 0.948 | 0.934 | 0.936 | |
ICBNMS | 0.993 | 0.957 | 0.970 | 0.956 | 0.984 | 0.975 | 0.930 | |
Yest_ME2 | NCL | 0.905 | 0.558 | 0.980 | 0.976 | 0.985 | 0.329 | 0.478 |
SMOTE | 0.909 | 0.465 | 0.904 | 0.973 | 0.990 | 0.281 | 0.516 | |
CDSMOTE | 0.916 | 0.647 | 0.973 | 0.980 | 0.967 | 0.445 | 0.335 | |
PNS | 0.895 | 0.341 | 0.983 | 0.972 | 0.994 | 0.207 | 0.655 | |
CBNMS | 0.908 | 0.435 | 0.983 | 0.972 | 0.995 | 0.217 | 0.601 | |
ADPST | 0.968 | 0.897 | 0.882 | 0.887 | 0.879 | 0.916 | 0.911 | |
ICBNMS | 0.996 | 0.973 | 0.982 | 0.972 | 0.993 | 0.954 | 0.989 | |
yeast 1289vs7 | NCL | 0.720 | 0.375 | 0.983 | 0.974 | 0.993 | 0.193 | 0.460 |
SMOTE | 0.759 | 0.327 | 0.913 | 0.974 | 0.991 | 0.200 | 0.422 | |
CDSMOTE | 0.736 | 0.602 | 0.552 | 0.549 | 0.668 | 0.666 | 0.579 | |
PNS | 0.719 | 0.353 | 0.984 | 0.971 | 0.996 | 0.166 | 0.643 | |
CBNMS | 0.709 | 0.309 | 0.980 | 0.972 | 0.995 | 0.146 | 0.476 | |
ADPST | 0.968 | 0.901 | 0.917 | 0.899 | 0.937 | 0.866 | 0.918 | |
ICBNMS | 0.991 | 0.969 | 0.984 | 0.969 | 0.990 | 0.949 | 0.984 | |
yeast4 | NCL | 0.909 | 0.553 | 0.881 | 0.972 | 0.986 | 0.341 | 0.511 |
SMOTE | 0.898 | 0.441 | 0.873 | 0.973 | 0.990 | 0.239 | 0.455 | |
CDSMOTE | 0.958 | 0.794 | 0.812 | 0.842 | 0.813 | 0.822 | 0.138 | |
PNS | 0.908 | 0.436 | 0.983 | 0.972 | 0.995 | 0.207 | 0.650 | |
CBNMS | 0.901 | 0.447 | 0.984 | 0.972 | 0.996 | 0.215 | 0.727 | |
ADPST | 0.967 | 0.900 | 0.885 | 0.881 | 0.890 | 0.911 | 0.918 | |
ICBNMS | 0.996 | 0.973 | 0.982 | 0.976 | 0.992 | 0.954 | 0.987 | |
yeast5 | NCL | 0.984 | 0.882 | 0.889 | 0.893 | 0.990 | 0.792 | 0.725 |
SMOTE | 0.977 | 0.840 | 0.887 | 0.891 | 0.992 | 0.739 | 0.750 | |
CDSMOTE | 0.984 | 0.969 | 0.940 | 0.987 | 0.930 | 0.958 | 0.331 | |
PNS | 0.984 | 0.791 | 0.990 | 0.985 | 0.994 | 0.644 | 0.850 | |
CBNMS | 0.983 | 0.789 | 0.989 | 0.986 | 0.994 | 0.635 | 0.779 | |
ADPST | 0.982 | 0.763 | 0.989 | 0.987 | 0.995 | 0.602 | 0.800 | |
ICBNMS | 0.998 | 0.978 | 0.991 | 0.986 | 0.992 | 0.965 | 0.980 |
数据集 | 算法 | AUC | G-mean | F-measure | PPV | TPR | TNR | NPV |
---|---|---|---|---|---|---|---|---|
abalone19 | NCL | 0.697 | 0.737 | 0.806 | 0.902 | 0.854 | 0.825 | 0.879+0.042 |
SMOTE | 0.723 | 0.736 | 0.863 | 0.852 | 0.876 | 0.771 | 0.390 | |
CDSMOTE | 0.972 | 0.723 | 0.769 | 0.816 | 0.742 | 0.819 | 0.773 | |
PNS | 0.947 | 0.877 | 0.893 | 0.972 | 0.882 | 0.872 | 0.812+0.062 | |
CBNMS | 0.994 | 0.874 | 0.983 | 0.992 | 0.979 | 0.687 | 0.887+0.001 | |
ADPST | 0.995 | 0.964 | 0.967 | 0.965 | 0.969 | 0.958 | 0.963 | |
ICBNMS | 0.999 | 0.989 | 0.993 | 0.990 | 0.995 | 0.993 | 0.983 | |
poker89vs5 | NCL | 0.936 | 0.811 | 0.981 | 0.969 | 0.989 | 0.683 | 0.908 |
SMOTE | 0.934 | 0.855 | 0.972 | 0.971 | 0.968 | 0.762 | 0.714 | |
CDSMOTE | 0.934 | 0.839 | 0.975 | 0.867 | 0.853 | 0.847 | 0.874 | |
PNS | 0.939 | 0.913 | 0.978 | 0.971 | 0.984 | 0.690 | 0.807 | |
CBNMS | 0.976 | 0.966 | 0.993 | 0.972 | 0.992 | 0.979 | 0.976 | |
ADPST | 0.994 | 0.952 | 0.989 | 0.984 | 0.993 | 0.978 | 0.974 | |
ICBNMS | 0.998 | 0.983 | 0.993 | 0.987 | 0.999 | 0.989 | 0.983 | |
shuttle2vs5 | NCL | 0.964 | 0.782 | 0.870 | 0.882 | 0.896 | 0.870 | 0.890 |
SMOTE | 0.887 | 0.793 | 0.868 | 0.850 | 0.888 | 0.712 | 0.783 | |
CDSMOTE | 0.885 | 0.879 | 0.968 | 0.837 | 0.810 | 0.673 | 0.805 | |
PNS | 0.980 | 0.899 | 0.907 | 0.895 | 0.828 | 0.879 | 0.830 | |
CBNMS | 0.989+0.004 | 0.955+0.002 | 0.932+0.007 | 0.942+0.006 | 0.934+0.011 | 0.996+0.000 | 0.991+0.004 | |
ADPST | 0.992+0.001 | 0.954+0.006 | 0.947+0.003 | 0.951+0.002 | 0.927+0.011 | 0.995+0.001 | 0.996+0.002 | |
ICBNMS | 0.993+0.001 | 0.959+0.002 | 0.968+0.002 | 0.977+0.001 | 0.953+0.002 | 0.997+0.000 | 0.998+0.001 | |
yeast6 | NCL | 0.913 | 0.705 | 0.790 | 0.987 | 0.878 | 0.514 | 0.552 |
SMOTE | 0.890 | 0.731 | 0.766 | 0.889 | 0.884 | 0.565 | 0.484 | |
CDSMOTE | 0.873 | 0.897 | 0.890 | 0.893 | 0.837 | 0.879 | 0.743 | |
PNS | 0.842 | 0.805 | 0.910 | 0.987 | 0.891 | 0.802 | 0.611 | |
CBNMS | 0.885 | 0.834 | 0.991 | 0.986 | 0.996 | 0.742 | 0.726 | |
ADPST | 0.993 | 0.956 | 0.979 | 0.972 | 0.987 | 0.927 | 0.965 | |
ICBNMS | 0.996 | 0.985 | 0.989 | 0.988 | 0.992 | 0.987 | 0.978 |
表7 高非均衡比率数据集上对比算法的评价指标结果
数据集 | 算法 | AUC | G-mean | F-measure | PPV | TPR | TNR | NPV |
---|---|---|---|---|---|---|---|---|
abalone19 | NCL | 0.697 | 0.737 | 0.806 | 0.902 | 0.854 | 0.825 | 0.879+0.042 |
SMOTE | 0.723 | 0.736 | 0.863 | 0.852 | 0.876 | 0.771 | 0.390 | |
CDSMOTE | 0.972 | 0.723 | 0.769 | 0.816 | 0.742 | 0.819 | 0.773 | |
PNS | 0.947 | 0.877 | 0.893 | 0.972 | 0.882 | 0.872 | 0.812+0.062 | |
CBNMS | 0.994 | 0.874 | 0.983 | 0.992 | 0.979 | 0.687 | 0.887+0.001 | |
ADPST | 0.995 | 0.964 | 0.967 | 0.965 | 0.969 | 0.958 | 0.963 | |
ICBNMS | 0.999 | 0.989 | 0.993 | 0.990 | 0.995 | 0.993 | 0.983 | |
poker89vs5 | NCL | 0.936 | 0.811 | 0.981 | 0.969 | 0.989 | 0.683 | 0.908 |
SMOTE | 0.934 | 0.855 | 0.972 | 0.971 | 0.968 | 0.762 | 0.714 | |
CDSMOTE | 0.934 | 0.839 | 0.975 | 0.867 | 0.853 | 0.847 | 0.874 | |
PNS | 0.939 | 0.913 | 0.978 | 0.971 | 0.984 | 0.690 | 0.807 | |
CBNMS | 0.976 | 0.966 | 0.993 | 0.972 | 0.992 | 0.979 | 0.976 | |
ADPST | 0.994 | 0.952 | 0.989 | 0.984 | 0.993 | 0.978 | 0.974 | |
ICBNMS | 0.998 | 0.983 | 0.993 | 0.987 | 0.999 | 0.989 | 0.983 | |
shuttle2vs5 | NCL | 0.964 | 0.782 | 0.870 | 0.882 | 0.896 | 0.870 | 0.890 |
SMOTE | 0.887 | 0.793 | 0.868 | 0.850 | 0.888 | 0.712 | 0.783 | |
CDSMOTE | 0.885 | 0.879 | 0.968 | 0.837 | 0.810 | 0.673 | 0.805 | |
PNS | 0.980 | 0.899 | 0.907 | 0.895 | 0.828 | 0.879 | 0.830 | |
CBNMS | 0.989+0.004 | 0.955+0.002 | 0.932+0.007 | 0.942+0.006 | 0.934+0.011 | 0.996+0.000 | 0.991+0.004 | |
ADPST | 0.992+0.001 | 0.954+0.006 | 0.947+0.003 | 0.951+0.002 | 0.927+0.011 | 0.995+0.001 | 0.996+0.002 | |
ICBNMS | 0.993+0.001 | 0.959+0.002 | 0.968+0.002 | 0.977+0.001 | 0.953+0.002 | 0.997+0.000 | 0.998+0.001 | |
yeast6 | NCL | 0.913 | 0.705 | 0.790 | 0.987 | 0.878 | 0.514 | 0.552 |
SMOTE | 0.890 | 0.731 | 0.766 | 0.889 | 0.884 | 0.565 | 0.484 | |
CDSMOTE | 0.873 | 0.897 | 0.890 | 0.893 | 0.837 | 0.879 | 0.743 | |
PNS | 0.842 | 0.805 | 0.910 | 0.987 | 0.891 | 0.802 | 0.611 | |
CBNMS | 0.885 | 0.834 | 0.991 | 0.986 | 0.996 | 0.742 | 0.726 | |
ADPST | 0.993 | 0.956 | 0.979 | 0.972 | 0.987 | 0.927 | 0.965 | |
ICBNMS | 0.996 | 0.985 | 0.989 | 0.988 | 0.992 | 0.987 | 0.978 |
数据集 | 分类器 | NCL | SMOTE | CDSMOTE | PNS | ADPST | ICBNMS |
---|---|---|---|---|---|---|---|
Ecoli | SVM | 0.05 | 0.09 | 0.37 | 0.32 | 0.09 | 0.09 |
RF | 0.53 | 0.65 | 1.64 | 1.44 | 0.57 | 0.54 | |
KNN | 0.11 | 0.31 | 1.34 | 1.14 | 0.16 | 0.20 | |
SatImage | SVM | 10.08 | 28.59 | 12.45 | 12.56 | 10.57 | 12.74 |
RF | 2.06 | 7.30 | 4.64 | 4.43 | 2.13 | 3.12 | |
KNN | 24.02 | 93.95 | 31.65 | 36.01 | 31.08 | 28.36 | |
Yest_ME2 | SVM | 0.17 | 0.45 | 0.42 | 0.44 | 0.17 | 0.25 |
RF | 0.61 | 1.29 | 0.96 | 1.64 | 0.54 | 0.52 | |
KNN | 1.34 | 5.12 | 2.63 | 1.54 | 1.06 | 1.21 | |
yeast 1289vs7 | SVM | 0.13 | 0.30 | 0.55 | 0.42 | 0.17 | 0.15 |
RF | 0.44 | 0.91 | 1.88 | 1.47 | 0.56 | 0.47 | |
KNN | 0.61 | 2.41 | 2.50 | 0.87 | 1.06 | 1.10 | |
yeast4 | SVM | 0.16 | 0.46 | 0.48 | 0.36 | 0.26 | 0.24 |
RF | 0.51 | 1.10 | 1.04 | 1.26 | 0.58 | 0.55 | |
KNN | 1.38 | 5.50 | 4.42 | 1.97 | 2.06 | 1.89 | |
yeast5 | SVM | 0.16 | 0.41 | 0.50 | 0.47 | 0.17 | 0.31 |
RF | 0.61 | 0.89 | 2.96 | 2.76 | 0.59 | 0.53 | |
KNN | 1.52 | 5.08 | 5.78 | 2.48 | 1.06 | 2.95 | |
总计 | 44.49 | 154.81 | 76.21 | 71.58 | 52.88 | 55.22 |
表8 不同对比算法在不同分类器上的时间对比结果
数据集 | 分类器 | NCL | SMOTE | CDSMOTE | PNS | ADPST | ICBNMS |
---|---|---|---|---|---|---|---|
Ecoli | SVM | 0.05 | 0.09 | 0.37 | 0.32 | 0.09 | 0.09 |
RF | 0.53 | 0.65 | 1.64 | 1.44 | 0.57 | 0.54 | |
KNN | 0.11 | 0.31 | 1.34 | 1.14 | 0.16 | 0.20 | |
SatImage | SVM | 10.08 | 28.59 | 12.45 | 12.56 | 10.57 | 12.74 |
RF | 2.06 | 7.30 | 4.64 | 4.43 | 2.13 | 3.12 | |
KNN | 24.02 | 93.95 | 31.65 | 36.01 | 31.08 | 28.36 | |
Yest_ME2 | SVM | 0.17 | 0.45 | 0.42 | 0.44 | 0.17 | 0.25 |
RF | 0.61 | 1.29 | 0.96 | 1.64 | 0.54 | 0.52 | |
KNN | 1.34 | 5.12 | 2.63 | 1.54 | 1.06 | 1.21 | |
yeast 1289vs7 | SVM | 0.13 | 0.30 | 0.55 | 0.42 | 0.17 | 0.15 |
RF | 0.44 | 0.91 | 1.88 | 1.47 | 0.56 | 0.47 | |
KNN | 0.61 | 2.41 | 2.50 | 0.87 | 1.06 | 1.10 | |
yeast4 | SVM | 0.16 | 0.46 | 0.48 | 0.36 | 0.26 | 0.24 |
RF | 0.51 | 1.10 | 1.04 | 1.26 | 0.58 | 0.55 | |
KNN | 1.38 | 5.50 | 4.42 | 1.97 | 2.06 | 1.89 | |
yeast5 | SVM | 0.16 | 0.41 | 0.50 | 0.47 | 0.17 | 0.31 |
RF | 0.61 | 0.89 | 2.96 | 2.76 | 0.59 | 0.53 | |
KNN | 1.52 | 5.08 | 5.78 | 2.48 | 1.06 | 2.95 | |
总计 | 44.49 | 154.81 | 76.21 | 71.58 | 52.88 | 55.22 |
1 | CORTES C, VAPNIK V. Support-vector networks[J]. Machine Learning, 1995, 20(3): 273-297. |
2 | BREIMAN L. Random forests[J]. Machine Learning, 2001, 45(1): 5-32. |
3 | HART P. The condensed nearest neighbor rule[J]. IEEE Transactions on Information Theory, 1968, 14(3): 515-516. |
4 | FOTOUHI S, ASADI S, KATTAN M W. A comprehensive data level analysis for cancer diagnosis on imbalanced data[J]. Journal of Biomedical Informatics, 2019, 90: 103089. |
5 | MAKKI S, ASSAGHIR Z, TAHER Y, et al. An experimental study with imbalanced classification approaches for credit card fraud detection[J]. IEEE Access, 2019, 7: 93010-93022. |
6 | 胡峰, 王蕾, 周耀. 基于三支决策的不平衡数据过采样方法[J]. 电子学报, 2018, 46(1): 135-144. |
HU F, WANG L, ZHOU Y. An oversampling method for imbalance data based on three-way decision model[J]. Acta Electronica Sinica, 2018, 46(1): 135-144. (in Chinese) | |
7 | HE H B, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284. |
8 | LAURIKKALA J. Improving identification of difficult small classes by balancing class distribution[C]//Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine. Berlin: Springer, 2001: 63-66. |
9 | TSAI C F, LIN W C, HU Y H, et al. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection[J]. Information Sciences, 2019, 477: 47-54. |
10 | VUTTIPITTAYAMONGKOL P, ELYAN E. Neighbourhood-based undersampling approach for handling imbalanced and overlapped data[J]. Information Sciences, 2020, 509: 47-70. |
11 | CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: Synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357. |
12 | DOUZAS G, BACAO F. Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE[J]. Information Sciences, 2019, 501: 118-135. |
13 | BATISTA G E A P A, PRATI R C, MONARD M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29. |
14 | ELYAN E, MORENO-GARCIA C F, JAYNE C. CDSMOTE: Class decomposition and synthetic minority class oversampling technique for imbalanced-data classification[J]. Neural Computing and Applications, 2021, 33(7): 2839-2851. |
15 | 张永清, 卢荣钊, 乔少杰, 等. 一种基于样本空间的类别不平衡数据采样方法[J]. 自动化学报,2020,DOI: 10.16383/j.aas.c200034 . |
16 | VOORHEES E M. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval[J]. Information Processing & Management, 1986, 22(6): 465-476. |
17 | NEKOOEIMEHR I, LAI-YUEN S K. Adaptive semi-unsupervised weighted oversampling(A-SUWO) for imbalanced datasets[J]. Expert Systems with Applications, 2016, 46: 405-416. |
18 | ALCALÁ-FDEZ J, SÁNCHEZ L, GARCÍA S, et al. KEEL: A software tool to assess evolutionary algorithms for data mining problems[J]. Soft Computing, 2009, 13(3): 307-318. |
19 | JAPKOWICZ N. Assessment metrics for imbalanced learning[M]//Imbalanced Learning. Hoboken: John Wiley & Sons, Inc., 2013: 187-206. |
20 | CHEN B Y, XIA S Y, CHEN Z Z, et al. RSMOTE: A self-adaptive robust SMOTE for imbalanced problems with label noise[J]. Information Sciences, 2021, 553: 397-428. |
21 | GAO X, REN B, ZHANG H, et al. An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling[J]. Expert Systems with Applications, 2020, 160: 113660. |
[1] | 徐佳伟, 罗倩. 基于遗传非参数MDL-BW方法的HMM结构优化[J]. 电子学报, 2022, 50(11): 2765-2772. |
[2] | 刘浩阳, 林耀进, 刘景华, 吴镒潾, 毛煜, 李绍滋. 由粗到细的分层特征选择[J]. 电子学报, 2022, 50(11): 2778-2789. |
[3] | 汪成亮, 赵凯, 刘嘉敏. 智能环境下基于边缘设备规则推理的数据预部署研究[J]. 电子学报, 2022, 50(10): 2347-2360. |
[4] | 李青青, 马慧芳, 李举, 李志欣, 姜彦斌. 属性网络中结合用户偏好的社区搜索和离群点检测[J]. 电子学报, 2022, 50(9): 2172-2180. |
[5] | 吴奇, 陈琪琦, 彭献永, 仇峰. 基于深度主题模型的飞行员脑疲劳检测[J]. 电子学报, 2022, 50(8): 1801-1810. |
[6] | 许新征, 李杉. 基于特征膨胀卷积模块的轻量化技术研究[J]. 电子学报, 2022, (): 1-10. |
[7] | 王雪松, 张翰林, 程玉虎. 基于自编码器和超图的半监督宽度学习系统[J]. 电子学报, 2022, 50(3): 533-539. |
[8] | 高俊涛, 王梅, 徐光会, 刘聪. 正则语言推断综述[J]. 电子学报, 2021, 49(12): 2479-2489. |
[9] | 孟海宁, 冯锴, 朱磊, 张贝贝, 童新宇, 黑新宏. 基于Laplacian图谱的短文本聚类算法[J]. 电子学报, 2021, 49(9): 1716-1723. |
[10] | 孙新, 盖晨, 申长虹, 张颖捷. 基于短语向量和主题加权的关键词抽取方法[J]. 电子学报, 2021, 49(9): 1682-1690. |
[11] | 周东明, 张灿龙, 李志欣, 王智文. 基于多层级视觉融合的图像描述模型[J]. 电子学报, 2021, 49(7): 1286-1290. |
[12] | 唐焕玲, 郑涵, 刘艳红, 马思源, 窦全胜, 鲁明羽. Tr-SLDA:一种面向交叉领域的迁移主题模型[J]. 电子学报, 2021, 49(3): 605-613. |
[13] | 金苍宏, 董腾然, 陈天翼, 吴明晖, 李国军, 周胜利. 融合序列分解与时空卷积的时序预测算法[J]. 电子学报, 2021, 49(2): 233-238. |
[14] | 汪成亮, 郑诚, 曾卓. 基于软件定义智能的睡眠动作识别[J]. 电子学报, 2021, 49(1): 85-89. |
[15] | 王格格, 郭涛, 余游, 苏菡. 基于生成对抗网络的无监督域适应分类模型[J]. 电子学报, 2020, 48(6): 1190-1197. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||