1.中国矿业大学矿山数字化工程研究中心,江苏徐州 221116
2.中国矿业大学计算机科学与技术学院,江苏徐州 221116
3.科华数据股份有限公司,广东深圳 518055
[ "张艳梅 女,1982年生,唐山人.博士.副教授, CCF专业会员.主要研究方向为软件分析与测试、软件缺陷预测等.E-mail: ymzhang@cumt.edu.cn" ]
[ "植胜林 (通讯作者) 男,1997 年生,梧州人.学士. 软件研发工程师. 主要研究领域为应用软件研发、软件分析与测试、软件缺陷预测等. E-mail: zhilincumt@163.com" ]
[ "姜淑娟 女,1966年生,莱阳人.博士.教授, 博士生导师,CCF专业会员.主要研究领域为软件分析与测试、编译技术等.E-mail: shjjiang@cumt.edu.cn" ]
[ "袁冠 男,1982年生,徐州人.博士.教授,博士生导师,CCF高级会员.主要研究领域为时空大数据技术、计算智能和软件工程等.E-mail: yuanguan@cumt.edu.cn" ]
收稿:2021-07-14,
修回:2022-08-17,
纸质出版:2023-08-25
移动端阅览
张艳梅,植胜林,姜淑娟等.类不平衡对软件缺陷预测模型稳定性和预测性能的影响分析方法[J].电子学报,2023,51(08):2076-2087.
ZHANG Yan-mei,ZHI Sheng-lin,JIANG Shu-juan,et al.Influence Analysis Method of Class Imbalance on Software Defect Prediction Model Stability and Prediction Performance[J].ACTA ELECTRONICA SINICA,2023,51(08):2076-2087.
张艳梅,植胜林,姜淑娟等.类不平衡对软件缺陷预测模型稳定性和预测性能的影响分析方法[J].电子学报,2023,51(08):2076-2087. DOI: 10.12263/DZXB.20210911.
ZHANG Yan-mei,ZHI Sheng-lin,JIANG Shu-juan,et al.Influence Analysis Method of Class Imbalance on Software Defect Prediction Model Stability and Prediction Performance[J].ACTA ELECTRONICA SINICA,2023,51(08):2076-2087. DOI: 10.12263/DZXB.20210911.
本文提出一种类不平衡对软件缺陷预测模型稳定性和预测性能的影响分析方法.首先,使用欠采样方法将原数据集构造成一组不平衡率小于原数据集本身不平衡率的新数据集.其中,在构造数据集时使用固定种子,保证同一个数据集构造的同一个不平衡率的数据集中的数据相同,以减少每次运行结果的随机性.其次,以MCC值作为预测模型的性能评价指标,将每次产生的新数据集放入模型中的分类算法进行训练预测评价,获得当前数据集不同不平衡率下的MCC值,并提出稳定性评价指标.实验结果表明:与AUC相比,MCC更适合作为类不平衡情况下软件缺陷预测模型稳定性的评价指标;对于软件缺陷预测性能稳定性,代价敏感模型表现优于集成模型.
The paper proposes a method for analyzing the influence of class imbalance on software defect prediction model stability and prediction performance. Firstly
the original data set is constructed into a set of new data sets whose unbalance rate is less than the original data set's unbalance rate by using the undersampling method. Where
fixed seeds are used in the construction of the data set to ensure that the data in the same unbalanced rate data set constructed by the same data set is the same
so as to reduce the randomness of the results of each run. Secondly
the MCC value is taken as the performance evaluation indicator of the prediction model
and the new data set generated each time is put into the classification algorithm of the model for training and prediction evaluation
so as to obtain the MCC value at different unbalanced rate for the current data set. We also propose a performance stability evaluation indicator. The experimental results show that
MCC is more suitable as the stability evaluation indicator of software defect prediction model under the condition of class imbalance compared with AUC. For the stability of software defect prediction performance
the cost sensitive model performs better than the ensemble model.
CATAL C , DIRI B N . Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem [J ] . Information Sciences , 2009 , 179 ( 8 ): 1040 - 1058 .
KRAWCZYK B . Learning from imbalanced data: Open challenges and future directions [J ] . Progress in Artificial Intelligence , 2016 , 5 ( 4 ): 221 - 232 .
CAI Q , HE H B , MAN H . Imbalanced evolving self-organizing learning [J ] . Neurocomputing , 2014 , 133 : 258 - 270 .
POWERS D M W . Evaluation: From precision, recall and F -measure to ROC, informedness, markedness and correlation [EB/OL ] . ( 2020-10-11 )[ 2021-07-14 ] . https://arxiv.org/abs/2010.16061 https://arxiv.org/abs/2010.16061 .
于巧 . 基于机器学习的软件缺陷预测方法研究 [D ] . 徐州 : 中国矿业大学 , 2017 .
YU Q . Research on Software Defect Prediction Method Based on Machine Learning [D ] . Xuzhou : China University of Mining and Technology , 2017 . (in Chinese)
CHAWLA N V , BOWYER K W , HALL L O , et al . SMOTE: Synthetic minority over-sampling technique [J ] . Journal of Artificial Intelligence Research , 2002 , 16 : 321 - 357 .
TAHIR M A , KITTLER J , YAN F . Inverse random under sampling for class imbalance problem and its application to multi-label classification [J ] . Pattern Recognition , 2012 , 45 ( 10 ): 3738 - 3750 .
SIERS M J , ISLAM M Z . Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem [J ] . Information Systems , 2015 , 51 : 62 - 71 .
ZHOU Z H , LIU X Y . Training cost-sensitive neural networks with methods addressing the class imbalance problem [J ] . IEEE Transactions on Knowledge and Data Engineering , 2006 , 18 ( 1 ): 63 - 77 .
LÓPEZ V , FERNÁNDEZ A , GARCÍA S , et al . An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics [J ] . Information Sciences , 2013 , 250 : 113 - 141 .
SUN Z B , SONG Q B , ZHU X Y , et al . A novel ensemble method for classifying imbalanced data [J ] . Pattern Recognition , 2015 , 48 ( 5 ): 1623 - 1637 .
GAO K H , KHOSHGOFTAAR T M . Assessments of feature selection techniques with respect to data sampling for highly imbalanced software measurement data [J ] . International Journal of Reliability, Quality and Safety Engineering , 2015 , 22 ( 2 ): 1550010 .
LARADJI I H , ALSHAYEB M , GHOUTI L . Software defect prediction using ensemble learning on selected features [J ] . Information and Software Technology , 2015 , 58 : 388 - 402 .
WANG S , YAO X . Using class imbalance learning for software defect prediction [J ] . IEEE Transactions on Reliability , 2013 , 62 ( 2 ): 434 - 443 .
WAHONO R S , SURYANA N . Combining particle swarm optimization based feature selection and bagging technique for software defect prediction [J ] . International Journal of Software Engineering and Its Applications , 2013 , 7 ( 5 ): 153 - 166 .
MALHOTRA R , JAIN J . Handling imbalanced data using ensemble learning in software defect prediction [C ] // 2020 10th International Conference on Cloud Computing , Data Science & Engineering (Confluence) . Piscataway : IEEE , 2020 : 300 - 304 .
王子健 . 基于主动集成学习的软件缺陷预测模型研究 [D ] . 秦皇岛 : 燕山大学 , 2021 .
WANG Z J . Research on Software Defect Prediction Model Based on Active Integrated Learning [D ] . Qinhuangdao : Yanshan University , 2021 . (in Chinese)
LESSMANN S , BAESENS B , MUES C , et al . Benchmarking classification models for software defect prediction: A proposed framework and novel findings [J ] . IEEE Transactions on Software Engineering , 2008 , 34 ( 4 ): 485 - 496 .
SONG Q B , GUO Y C , SHEPPERD M . A comprehensive investigation of the role of imbalanced learning for software defect prediction [J ] . IEEE Transactions on Software Engineering , 2019 , 45 ( 12 ): 1253 - 1269 .
BALOGUN A , BASRI S , ABDULKADIR S J , et al . Software defect prediction: Analysis of class imbalance and performance stability [J ] . Journal of Engineering Science and Technology , 2019 , 14 ( 6 ): 3294 - 3308 .
ELDHO K J . Impact of unbalanced classification on the performance of software defect prediction models [J ] . Indian Journal of Science and Technology , 2022 , 15 ( 6 ): 237 - 242 .
宫丽娜 , 姜淑娟 , 姜丽 . 软件缺陷预测技术研究进展 [J ] . 软件学报 , 2019 , 30 ( 10 ): 3090 - 3114
GONG L N , JIANG S J , JIANG L . Research progress of software defect prediction [J ] . Journal of Software , 2019 , 30 ( 10 ): 3090 - 3114 . (in Chinese)
QIAO X Y , LIU Y F . Adaptive weighted learning for unbalanced multicategory classification [J ] . Biometrics , 2009 , 65 ( 1 ): 159 - 168 .
0
浏览量
13
下载量
1
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621