Evaluation of the Factors Affecting Classification Performance in Class Imbalance Problem

Duygu Aydin Hakli; Dincer Goksuluk; Erdem Karabulut

Authors

Duygu Aydin Hakli Assist. Prof., Istanbul Arel University, Faculty of Medicine, Department of Biostatistics, Postcode 34010, Istanbul, Türkiye
Dincer Goksuluk Assist. Prof., Erciyes University, Faculty of Medicine, Department of Biostatistics, Postcode 38030, Kayseri, Türkiye
Erdem Karabulut Prof. , Erciyes University, Faculty of Medicine, Department of Biostatistics, Postcode 06230, Ankara, Türkiye

Keywords:

classification, imbalance learning, machine learning, oversampling, undersampling

Abstract

In binary classification, when the distribution of numbers in the class is imbalanced, we are aimed to increase the accuracy of classification in classification methods. In our study, simulated data sets and actual data sets are used. In the simulation, the "BinNor" package in the R project, which produces both numerical and categorical data, was utilized. When simulation work is planned, three different effects are considered which may affect the classification performance. These are: sample size, correlation structure and class imbalance rates. Scenarios were created by considering these effects. Each scenario was repeated 1000 times and 10-fold cross-validation was applied. CART, SVM and RF methods have been used in the classification of data sets obtained from both simulation and actual data sets. SMOTE, SMOTEBoost and RUSBoost were used to decrease or completely remove the imbalance of the data before the classification methods were applied. Specificity, sensitivity, balanced accuracy and F-measure were used as performance measures. The simulation results: the imbalance rate increases from 10 to 30, the effect of the 3 algorithms on the classification methods is similar accuracy. Because the class imbalance has become balanced.

References

N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, ”SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artifical Intelligence Research, vol. 16, pp. 321-357, 2002.

H. He, Y. Ma, Imbalanced Learning. Hoboken, New Jersey, pp. 13-36, 2013.

M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, “A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches,” IEEE Trans. On Systems, Man, And Cybernetics—Part C: Application And Reviews, vol. 42, no. 4, July 2012.

X. Liu, J. Wu, and Z. Zhou, “Exploratory Undersampling for Class-Imbalance Learning,” IEEE Trans. On Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 39, no. 2, pp. 539-550, December 2008.

N.V. Chawla, A. Lazarevic, L.O. Hall and K.W. Bowyer, “SMOTEBoost: Improving Prediction of the Minority Class in Boosting,” Proc. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 107-119, 2003, doi:10.1007/978-3-540-39804-2_12.

C. Seiffert, T.M. Khoshgoftaar, J.V. Hulse and A. Napolitano,”RUSBoost: A Hybrid Approach to Alleviating Class Imbalance,” IEEE Trans. On Systems, Man, And Cybernetics—Part A: System and Humans, vol. 40, no. 1, pp. 185-197, Jan 2010, doi: 10.1109/TSMCA.2009.2029559.

A. Estabrooks, T. Jo And N. Japkowicz,” A Multiple Resampling Method for Learning from Imbalanced Data Sets,” Computational Intelligence, vol. 20, no.1, pp. 18-36, Feb 2004, doi: 10.1111/j.0824-7935.2004.t01-1-00228.x

C.V. KrishnaVeni, T. Sobha Rani,”On the Classification of Imbalanced Datasets,” International Journal of Computer Science and Technology vol. 2, no. 1, pp. 145-148, December 2011.

J. Zhang and I. Mani, “kNN approach to unbalanced data distributions: A case study involving information extraction,” Proc. Int. Conf. Mach. Learning, pp. 42–48, 2003.

P. Cao, D. Zhao, and O. Zaiane,”An optimized cost-sensitive SVM for imbalanced data learning,” Proc. Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pp. 280-292, 2013.

D. Cieslak and N. Chawla, “Learning decision trees for unbalanced data,” Proc. Machine Learning and Knowledge Discovery in Databases, pp. 241–256, 2008.

N. V. Chawla, N. Japkowicz, and A. Kotcz, Eds.,”Imbalanced Data Sets,” Proc. ICML Workshop Learn, 2003.

C. Drummond and R. C. Holte,”C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling,” Proc. Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning, 2003.

N. Zhong, L. Zhou, Springer Verlag, Methodologies for Knowledge Discovery and Data Mining, Third Pacific-Asia Conference, Pakdd-99, (1999).””,

V.N. Vapnik, The Nature of Statistical Learning Theory. Verlag: New York., pp. 123-167, 1999.

J. Han, M. Kamber, Data Mining: Concepts and Techniques. San Francisco: CA, pp. 285-382, 2006.

I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. pp. 143-151, 2005.

H. Demirtas , A. Amatya and B. Doganay,” BinNor: An R Package for Concurrent Generation of Binary and Normal Data,” Communications in Statistics - Simulation and Computation, vol. 43, no. 3, pp. 569-579, Jan 2014, doi: 10.1080/03610918.2012.707725.

M. Kuhn,”Building Predictive Models in R Using the caret Package,” Journal of Statistical Software, vol. 28, no. 5, pp. 1-26, 2008, doi: 10.18637/jss.v028.i05

T. Therneau, B. Atkinson and B. Ripley, https://cran.r-project.org/web/packages/rpart/rpart.pdf. 2014.

N. Lunardon, G. Menardi, and N. Torelli,” ROSE: A Package for Binary Imbalanced Learning,” The R Journal, vol. 6, no. 1, 2014.

Evaluation of the Factors Affecting Classification Performance in Class Imbalance Problem

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Information

Developed By

Current Issue