Abstract
In the classification framework there are problems in which the number of examples per class is not equitably distributed, formerly known as imbalanced data sets. This situation is a handicap when trying to identify the minority classes, as the learning algorithms are not usually adapted to such characteristics. An usual approach to deal with the problem of imbalanced data sets is the use of a preprocessing step. In this paper we analyze the usefulness of the data complexity measures in order to evaluate the behavior of undersampling and oversampling methods. Two classical learning methods, C4.5 and PART, are considered over a wide range of imbalanced data sets built from real data. Specifically, oversampling techniques and an evolutionary undersampling one have been selected for the study. We extract behavior patterns from the results in the data complexity space defined by the measures, coding them as intervals. Then, we derive rules from the intervals that describe both good or bad behaviors of C4.5 and PART for the different preprocessing approaches, thus obtaining a complete characterization of the data sets and the differences between the oversampling and undersampling results.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
References
Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) KEEL: A software tool to assess evolutionary algorithms to data mining problems. Soft Comput 13(3):307–318
Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36(3):849–851
Basu M, Ho TK (2006) Data complexity in pattern recognition (advanced information and knowledge processing). Springer-Verlag New York, Inc., Secaucus
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29
Baumgartner R, Somorjai RL (2006) Data complexity assessment in undersampled classification of high-dimensional biomedical data. Pattern Recognit Lett 12:1383–1389
Bernadó-Mansilla E, Ho TK (2005) Domain of competence of XCS classifier system in complexity measurement space. IEEE Trans Evol Comput 9(1):82–104
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159
Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2009) Metalearning: applications to data mining. Cognitive Technologies, Springer.http://10.255.0.115/pub/2009/BGSV09
Celebi M, Kingravi H, Uddin B, Iyatomi H, Aslandogan Y, Stoecker W, Moss R (2007) A methodological approach to the classification of dermoscopy images. Comput Med Imaging Graphics 31(6):362–373
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Japkowicz N, Kolcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1):1–6
Diamantini C, Potena D (2009) Bayes vector quantizer for class-imbalance problem. IEEE Trans Knowl Data Eng 21(5):638–651
Domingos P (1999) Metacost: a general method for making classifiers cost sensitive. In: Advances in neural networks, Int J Pattern Recognit Artif Intell, pp 155–164
Dong M, Kothari R (2003) Feature subset selection using a new definition of classificabilty. Pattern Recognit Lett 24:1215–1225
Drown DJ, Khoshgoftaar TM, Seliya N (2009) Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Trans Syst Man Cybern A 39(5):1097–1107
Eshelman LJ (1991) Foundations of genetic algorithms, chap The CHC adaptive search algorithm: how to safe search when engaging in nontraditional genetic recombination, pp 265–283
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36
Fernández A, García S, del Jesus MJ, Herrera F (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18):2378–2398
Frank E, Witten IH (1998) Generating accurate rule sets without global optimization. In: ICML ’98: Proceedings of the fifteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, pp 144–151
García S, Herrera F (2009a) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306
García S, Fernández A, Herrera F (2009b) Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Appl Soft Comput 9(4):1304–1314
García S, Cano JR, Bernadó-Mansilla E, Herrera F (2009c) Diagnose of effective evolutionary prototype selection using an overlapping measure. Int J Pattern Recognit Artif Intell 23(8):2378–2398
García V, Mollineda R, Sánchez JS (2008) On the k–NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300
Hoekstra A, Duin RP (1996) On the nonlinearity of pattern classifiers. In: ICPR ’96: Proceedings of the international conference on pattern recognition (ICPR ’96) Volume IV-Volume 7472, IEEE Computer Society, Washington, DC, pp 271–275
Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310
Kalousis A (2002) Algorithm selection via meta-learning. PhD thesis, Université de Geneve
Kilic K, Uncu O, Türksen IB (2007) Comparison of different strategies of utilizing fuzzy clustering in structure identification. Inform Sci 177(23):5153–5162
Kim SW, Oommen BJ (2009) On using prototype reduction schemes to enhance the computation of volume-based inter-class overlap measures. Pattern Recognit 42(11):2695–2704
Li Y, Member S, Dong M, Kothari R, Member S (2005) Classifiability-based omnivariate decision trees. IEEE Trans Neural Netw 16(6):1547–1560
Lu WZ, Wang D (2008) Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Sci Total Environ 395(2–3):109–116
Luengo J, Herrera F (2010) Domains of competence of fuzzy rule based classification systems with data complexity measures: A case of study using a fuzzy hybrid genetic based machine learning method. Fuzzy Sets Syst 161(1):3–19
Mazurowski M, Habas P, Zurada J, Lo J, Baker J, Tourassi G (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436
Mollineda RA, Sánchez JS, Sotoca JM (2005) Data characterization for effective prototype selection. In: First edition of the Iberian conference on pattern recognition and image analysis (IbPRIA 2005), Lecture Notes in Computer Science 3523, pp 27–34
Orriols-Puig A, Bernadó-Mansilla E (2008) Evolutionary rule-based systems for imbalanced data sets. Soft Comput 13(3):213–225
Peng X, King I (2008) Robust BMPM training based on second-order cone programming and its application in medical diagnosis. Neural Netw 21(2–3):450–457
Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-learning by landmarking various learning algorithms. In: ICML ’00: Proceedings of the seventeenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, pp 743–750
Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, San Mateo–California
Sánchez J, Mollineda R, Sotoca J (2007) An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl 10(3):189–201
Singh S (2003) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell 25(12):1534–1539
Su CT, Hsiao YH (2007) An evaluation of the robustness of MTS for imbalanced data. IEEE Trans Knowl Data Eng 19(10):1321–1332
Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost–sensitive boosting for classification of imbalanced data. Pattern Recognit 40:3358–3378
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: A review. Int J Pattern Recognit Artif Intell 23(4):687–719
Tang Y, Zhang YQ, Chawla N (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B Cybern 39(1):281–288
Williams D, Myers V, Silvious M (2009) Mine classification with imbalanced data. IEEE Geosci Remote Sens Lett 6(3):528–532
Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inform Tech Decis Mak 5(4):597–604
Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77
Acknowledgments
This work has been supported by the Spanish Ministry of Education and Science under Project TIN2008-06681-C06-(01 and 02). J. Luengo holds a FPU scholarship from Spanish Ministry of Education.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Figures with the intervals of PART and C4.5
In this appendix, the figures sorted by the F1, N4 and L3 data complexity measures are depicted. We have used a two-column representation for the figures, so in each row we present the results for C4.5 and PART for the same case of type of preprocessing and data complexity measure used.
-
Figures from 13, 14, 15, 16, 17, 18 represents the figures for the case of SMOTE preprocessing.
-
Figures from 19, 20, 21, 22, 23, 24 represents the figures for the case of SMOTE-ENN preprocessing.
-
Figures from 25, 26, 27, 28, 29, 30 represents the figures for the case of EUSCHC preprocessing.
Appendix 2: Tables of results
In this appendix we present the average AUC results for C4.5 and PART in Tables 15 and 16 respectively.
Rights and permissions
About this article
Cite this article
Luengo, J., Fernández, A., García, S. et al. Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Comput 15, 1909–1936 (2011). https://doi.org/10.1007/s00500-010-0625-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-010-0625-8