Abstract:Optimal hyperplane tendency and a large number of positive sample misclassifications often appear when the standard support vector machine (SVM) is employed to classify unbalanced data.So several causes and corresponding countermeasures for the perspective of SVM misclassifying unbalanced data are discussed.Considering the characteristics of SVM that optimal hyperplane is only decided by a small amount of support vectors,a novel SVM mathematical model based on negative boundary sample cutting strategy is constructed.However,this model has better recognition performance on positive samples only when the "training-cutting" step of negative samples is carried out many times,which is a time-consuming process.To replace it with the equivalent cutting hyperplane technique which can cut more negative samples at one time,an unbalanced SVM algorithm coupling negative-samples cutting with asymmetric misclassification cost is proposed.To further enhance the classification ability of this algorithm on unbalanced data,an improved sine cosine algorithm (ISCA) is presented to optimize the biased constant of the cutting hyperplane.Experimental results verify the optimized necessity of the biased constant of the cutting hyperplane,the advanced optimization performance of ISCA algorithm and the outstanding recognition performance of the proposed algorithm on unbalanced datasets,respectively.
[1] Vapnik V.The Nature of Statistical Learning Theory[M].New York:Wiley,1998.
[2] 王友卫,刘元宁,凤丽洲,等.基于用户兴趣集的在线垃圾邮件快速识别新方法[J].电子学报,2015,43(10):1963-1970. WANG Y W,LEU Y N,et al.A novel quick online spam identification method based on user interest set[J].Acta Electronica Sinica,2015,43(10):1963-1970.(in chinese)
[3] Shafiabady N,Lee L H,Rajkumar R,et al.Using unsupervised clustering approach to train the support vector machine for text classification[J].Neurocomputing,2016,211:4-10.
[4] Shao Y H,Chen W J,Wang Z,et al.Weighted linear loss twin support vector machine for large-scale classification[J].Knowledge-Based Systems,2015,73:276-288.
[5] Japkowic Z N,Stephen S.The class imbalance problem:A systematic study[J].Intelligent Data Analysis,2002,6(5):429-449.
[6] Chawla N V,Bowyer K W,Hall L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[7] Drummond C,Holte R C.C4.5,class imbalance,and cost sensitivity:why under-sampling beats over-sampling[A].Workshop on Learning from Imbalanced Datasets Ⅱ[C].Washington DC:Citeseer,2003:1-8.
[8] Kubat M,Matwin S.Addressing the curse of imbalanced training sets:one-sided selection[A].Proc 14th ICML[C].Nashville:Morgan Kaufmann Publishers,1997.179-186.
[9] Sain H,Purnami S W.Combine sampling support vector machine for imbalanced data classification[J].Procedia Computer Science,2015,72(1):59-66.
[10] Zhang Y,Fu P,Liu W,et al.Imbalanced data classification based on scaling kernel-based support vector machine[J].Neural Computing and Applications,2014,25(3-4):927-935.
[11] Maldonado S,López J.Imbalanced data classification using second-order cone programming support vector machines[J].Pattern Recognition,2014,47(5):2070-2079.
[12] Datta S,Das S.Near-bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs[J].Neural Networks,2015,70:39-52.
[13] Jian C,Gao J,Ao Y.A new sampling method for classifying imbalanced data based on support vector machine ensemble[J].Neurocomputing,2016,193:115-122.
[14] Wang X,Huang F,Cheng Y.Super-parameter selection for gaussian-kernel SVM based on outlier-resisting[J].Measurement,2014,58:147-153.
[15] Mirjalili S.SCA:a sine cosine algorithm for solving optimization problems[J].Knowledge-Based Systems,2016,96:120-133.
[16] Eberhart R C,Kennedy J.A new optimizer using particle swarm theory[A].Proc.Sixth International Symposium on MICRO Machine and Human Science[C].Nagoya,Japan:IEEE Press,2002.39-43.
[17] Karaboga D.An idea based on honey bee swarm for numerical optimization[R].Kayseri:Erciyes University,Engineering Faculty,Computer Engineering Department,2005.
[18] Mirjalili S,Mirjalili S M,Hatamlou A.Multi-verse optimizer:a nature-inspired algorithm for global optimization[J].Neural Computing and Applications,2016,27(2):495-513.
[19] Fernández A,García S,del Jesus M J,et al.A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets[J].Fuzzy Sets and Systems,2008,159(18):2378-2398.
[20] Chang C C,Lin C J.LIBSVM:a library for support vector machines[J].ACM Transactions on Intelligent Systems and Technology (TIST),2011,2(3):27.