

浏览全部资源
扫码关注微信
1.中国矿业大学计算机科学与技术学院,江苏徐州 221116
2.矿山数字化教育部工程研究中心,江苏徐州 221116
Received:04 March 2024,
Revised:2024-07-22,
Published:25 October 2024
移动端阅览
孙中彬, 刁宇轩, 马苏洋. 基于安全欠采样的不均衡多标签数据集成学习方法[J]. 电子学报, 2024, 52(10): 3392-3408.
SUN Zhong-bin, DIAO Yu-xuan, MA Su-yang. An Imbalanced Multi-Label Data Ensemble Learning Method Based on Safe Under-Sampling[J]. Acta Electronica Sinica, 2024, 52(10): 3392-3408.
孙中彬, 刁宇轩, 马苏洋. 基于安全欠采样的不均衡多标签数据集成学习方法[J]. 电子学报, 2024, 52(10): 3392-3408. DOI:10.12263/DZXB.20240210
SUN Zhong-bin, DIAO Yu-xuan, MA Su-yang. An Imbalanced Multi-Label Data Ensemble Learning Method Based on Safe Under-Sampling[J]. Acta Electronica Sinica, 2024, 52(10): 3392-3408. DOI:10.12263/DZXB.20240210
多标签分类任务广泛存在于现实生活中,然而其经常存在不均衡数据问题,严重影响了分类性能.目前解决该问题的主流技术为重采样方法,主要分为过采样和欠采样,过采样通过生成与少数类标签相关的样本,欠采样则是通过删除与多数类标签相关的样本.然而,这些方法都专注于解决一种不均衡问题,即标签内不均衡或标签间不均衡,导致在解决一种不均衡的同时可能引入另一种不均衡.针对该问题,本文提出一种基于安全欠采样的不均衡多标签数据集成学习方法ESUS(Ensemble learning method based on Safe Under-Sampling).首先通过标签划分将多标签不均衡数据集划分成单标签数据集和标签对数据集,针对单标签数据集,提出一种安全欠采样方法解决标签内不均衡问题,并利用采样后的均衡数据集构建二分类模型.对于标签对数据集,进行数据剪枝后利用集成学习解决标签间不均衡问题,在保持分类性能的同时降低时空复杂度.最后将单标签数据集模型和标签对数据集模型集成为最终的分类模型.在六个多标签不均衡数据集上的实验结果表明:和七种对比方法相比,ESUS方法在四个评价指标上更稳定有效.
The task of multi-label classification is widely present in real life
but there is often an issue of imbalanced data
which seriously affects the classification performance. At present
the mainstream technology for solving this problem is resampling
which are mainly divided into over-sampling and under-sampling. Particularly
over-sampling generates samples related to minority class labels while under-sampling removes samples related to majority class labels. However
these methods all focus on solving an imbalance problem
namely intra label imbalance or inter label imbalance
which may introduce another imbalance problem while solving one imbalance problem. In response to this issue
this paper proposes an imbalanced multi-label data ensemble learning method ESUS (Ensemble learning method based on Safe Under-Sampling) based on safe under-sampling. Firstly
the imbalanced multi-label dataset is divided into single label datasets and label pair datasets through label partitioning. For single label datasets
this paper proposes a secure under-sampling method to solve the problem of intra label imbalance
and constructs binary classification models using the sampled balanced dataset. For label pair datasets
ensemble learning is used on the pruned data to solve the problem of inter label imbalance
which may maintain the classification performance of the model and reduce spatiotemporal complexity. Finally
the single label dataset models and label pair dataset models are integrated into the final classification model. The experimental results on six imbalanced multi-label datasets show that compared with seven comparison methods
the ESUS method is more stable and effective on four evaluation metrics.
BHATTACHARYA S , RAJAN V , SHRIVASTAVA H . ICU mortality prediction: A classification algorithm for imbalanced datasets [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Menlo Park : AAAI , 2017 : 1288 - 1294 .
ZHONG W C , RAAHEMI B , LIU J . Classifying peer-to-peer applications using imbalanced concept-adapting very fast decision tree on IP data stream [J ] . Peer-to-Peer Networking and Applications , 2013 , 6 ( 3 ): 233 - 246 .
ZAKARYAZAD A , DUMAN E . A profit-driven artificial neural network (ANN) with applications to fraud detection and direct marketing [J ] . Neurocomputing , 2016 , 175 : 121 - 131 .
ZHU Y , KWOK J T , ZHOU Z H . Multi-label learning with global and local label correlation [J ] . IEEE Transactions on Knowledge and Data Engineering , 2018 , 30 ( 6 ): 1081 - 1094 .
ZHANG M L , ZHOU Z H . A review on multi-label learning algorithms [J ] . IEEE Transactions on Knowledge and Data Engineering , 2014 , 26 ( 8 ): 1819 - 1837 .
TSOUMAKAS G , VLAHAVAS I . Random k-labelsets: an ensemble method for multilabel classification [C ] // European Conference on Machine Learning . Berlin : Springer , 2007 : 406 - 417 .
ALMEIDA T B , BORGES H B . An Adaptation of the ML-kNN algorithm to predict the number of classes in hierarchical multi-label classification [M ] // Modeling Decisions for Artificial Intelligence . Cham : Springer International Publishing , 2017 : 77 - 88 .
CHEN L J , FU Y G , CHEN N N , et al . Rule reduction for ebrb classification based on clustering [C ] // International Conference on Web Information Systems and Applications . Berlin : Springer , 2021 : 442 - 454 .
ELISSEEFF A , WESTON J . A kernel method for multi-labelled classification [C ] // Advances in Neural Information Processing Systems . British Columbia : MIT Press , 2001 : 681 - 688 .
ZHANG M L , ZHOU Z H . A k-nearest neighbor based algorithm for multi-label classification [C ] // IEEE International Conference on Granular Computing . Piscataway : IEEE , 2005 : 718 - 721 .
BOUTELL M R , LUO J B , SHEN X P , et al . Learning multi-label scene classification [J ] . Pattern Recognition , 2004 , 37 ( 9 ): 1757 - 1771 .
TSOUMAKAS G , KATAKIS I , VLAHAVAS I . Random k-labelsets for multilabel classification [J ] . IEEE Transactions on Knowledge and Data Engineering , 2011 , 23 ( 7 ): 1079 - 1089 .
READ J , PFAHRINGER B , HOLMES G , et al . Classifier chains for multi-label classification [J ] . Machine Learning , 2011 , 85 ( 3 ): 333 - 359 .
YU G X , DOMENICONI C , RANGWALA H , et al . Transductive multi-label ensemble classification for protein function prediction [C ] // Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . New York : ACM , 2012 : 1077 - 1085 .
ZHANG W B , PINCUS Z . Predicting all-cause mortality from basic physiology in the Framingham Heart Study [J ] . Aging Cell , 2016 , 15 ( 1 ): 39 - 48 .
胡峰 , 王蕾 , 周耀 . 基于三支决策的不平衡数据过采样方法 [J ] . 电子学报 , 2018 , 46 ( 1 ): 135 - 144 .
HU F , WANG L , ZHOU Y . An oversampling method for imbalance data based on three-way decision model [J ] . Acta Electronica Sinica , 2018 , 46 ( 1 ): 135 - 144 . (in Chinese)
张艳梅 , 植胜林 , 姜淑娟 , 等 . 类不平衡对软件缺陷预测模型稳定性和预测性能的影响分析方法 [J ] . 电子学报 , 2023 , 51 ( 8 ): 2076 - 2087 .
ZHANG Y M , ZHI S L , JIANG S J , et al . Influence analysis method of class imbalance on software defect prediction model stability and prediction performance [J ] . Acta Electronica Sinica , 2023 , 51 ( 8 ): 2076 - 2087 . (in Chinese)
GUZMÁN-PONCE A , VALDOVINOS R M , SÁNCHEZ J S , et al . A new under-sampling method to face class overlap and imbalance [J ] . Applied Sciences , 2020 , 10 ( 15 ): 5164 .
TAREKEGN A N , GIACOBINI M , MICHALAK K . A review of methods for imbalanced multi-label classification [J ] . Pattern Recognition , 2021 , 118 : 107965 .
CHARTE F , RIVERA A , DEL JESUS M J , et al . Resampling multilabel datasets by decoupling highly imbalanced labels [M ] // Lecture Notes in Computer Science . Cham : Springer International Publishing , 2015 : 489 - 501 .
CHARTE F , RIVERA A J , DEL JESUS M J , et al . MLeNN: A first approach to heuristic multilabel undersampling [M ] // Intelligent Data Engineering and Automated Learning —IDEAL 2014 . Cham : Springer International Publishing , 2014 : 1 - 9 .
CHARTE F , RIVERA A J , DEL JESUS M J , et al . Addressing imbalance in multilabel classification: Measures and random resampling algorithms [J ] . Neurocomputing , 2015 , 163 : 3 - 16 .
GUO H X , LI Y J , SHANG J , et al . Learning from class-imbalanced data: Review of methods and applications [J ] . Expert Systems with Applications , 2017 , 73 : 220 - 239 .
CHARTE F , RIVERA A , DEL JESUS M J , et al . A first approach to deal with imbalance in multi-label datasets [C ] // International Conference on Hybrid Artificial Intelligence Systems . Salamanca : Springer , 2013 : 150 - 160 .
PEREIRA R M , COSTA Y M G , Jr SILLA C N . MLTL: A multi-label approach for the Tomek link undersampling algorithm [J ] . Neurocomputing , 2020 , 383 : 95 - 105 .
BATISTA G E A P A , PRATI R C , MONARD M C . A study of the behavior of several methods for balancing machine learning training data [J ] . ACM SIGKDD Explorations Newsletter , 2004 , 6 ( 1 ): 20 - 29 .
WILSON D L . Asymptotic properties of nearest neighbor rules using edited data [J ] . IEEE Transactions on Systems, Man, and Cybernetics , 1972 , SMC-2( 3 ): 408 - 421 .
陈旭 , 刘鹏鹤 , 孙毓忠 , 等 . 面向不均衡医学数据集的疾病预测模型研究 [J ] . 计算机学报 , 2019 , 42 ( 3 ): 596 - 609 .
CHEN X , LIU P H , SUN Y Z , et al . Research on disease prediction models based on imbalanced medical data sets [J ] . Chinese Journal of Computers , 2019 , 42 ( 3 ): 596 - 609 . (in Chinese)
LIU B , BLEKAS K , TSOUMAKAS G . Multi-label sampling based on local label imbalance [J ] . Pattern Recognition , 2022 , 122 : 108294 .
CHAWLA N V , BOWYER K W , HALL L O , et al . SMOTE: Synthetic minority over-sampling technique [J ] . Journal of Artificial Intelligence Research , 2002 , 16 : 321 - 357 .
CHARTE F , RIVERA A J , DEL JESUS M J , et al . MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation [J ] . Knowledge-Based Systems , 2015 , 89 : 385 - 397 .
MISHRA N K , SINGH P K . Feature construction and smote-based imbalance handling for multi-label learning [J ] . Information Sciences , 2021 , 563 : 342 - 357 .
SADHUKHAN P , PALIT S . Reverse-nearest neighborhood based oversampling for imbalanced, multi-label datasets [J ] . Pattern Recognition Letters , 2019 , 125 : 813 - 820 .
JO W , KIM D . OBGAN: Minority oversampling near borderline with generative adversarial networks [J ] . Expert Systems with Applications , 2022 , 197 : 116694 .
ZHANG K , MAO Z Y , CAO P , et al . Label correlation guided borderline oversampling for imbalanced multi-label data learning [J ] . Knowledge-Based Systems , 2023 , 279 : 110938 .
ZHU T , LIU X , ZHU E . Oversampling with reliably expanding minority class regions for imbalanced data learning [J ] . IEEE Transactions on Knowledge and Data Engineering , 2023 , 35 ( 6 ): 6167 - 6181 .
ARTHUR D , VASSILVITSKII S . k-means++: The advantages of careful seeding [C ] // Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms . New Orleans : SIAM , 2007 : 1027 - 1035 .
BOATENG E Y , OTOO J , ABAYE D A . Basic tenets of classification algorithms k-nearest-neighbor, support vector machine, random forest and neural network: A review [J ] . Journal of Data Analysis and Information Processing , 2020 , 8 ( 4 ): 341 - 357 . .
0
Views
1
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621