research-article

Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier]

Authors:

Miriam Seoane Santos,

Jastin Pompeu Soares,

Pedro Henrigues Abreu,

Joao SantosAuthors Info & Claims

IEEE Computational Intelligence Magazine, Volume 13, Issue 4

Pages 59 - 76

https://doi.org/10.1109/MCI.2018.2866730

Published: 01 November 2018 Publication History

Abstract

Although cross-validation is a standard procedure for performance evaluation, its joint application with oversampling remains an open question for researchers farther from the imbalanced data topic. A frequent experimental flaw is the application of oversampling algorithms to the entire dataset, resulting in biased models and overly-optimistic estimates.

References

[1]

H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, June 2009.

Digital Library

[2]

V. López, A. Fernández, J. G. Moreno-Torres, and F. Herrera, “Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics,” Expert Syst. Appl., vol. 39, no. 7, pp. 6585–6608, June 2012.

Digital Library

[3]

N. V. Chawla, N. Japkowicz, and A. Kotcz, “Special issue on learning from imbalanced data sets,” ACM SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 1–6, June 2004.

Digital Library

[4]

R. Mollineda, R. Alejo, and J. Sotoca, “The class imbalance problem in pattern classification and learning,” in Proc. II Congreso Español de Informática, Sept. 2007, pp. 978–984.

[5]

V. Ganganwar, “An overview of classification algorithms for imbalanced datasets,” Int. J. Emerging Technol. Adv. Eng., vol. 2, no. 4, pp. 42–47, Apr. 2012.

[6]

U. Bhowan, M. Johnston, M. Zhang, and X. Yao, “Evolving diverse ensembles using genetic programming for classification with unbalanced data,” IEEE Trans. Evol. Comput., vol. 17, no. 3, pp. 368–386, May 2013.

Digital Library

[7]

T. Maciejewski and J. Stefanowski, “Local neighbourhood extension of SMOTE for mining imbalanced data,” in Proc. IEEE Symp. Computational Intelligence and Data Mining, Apr. 2011, pp. 104–111.

[8]

G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst. Appl., vol. 73, pp. 220–239, May 2017.

Digital Library

[9]

P. Fergus, P. Cheung, A. Hussain, D. Al-Jumeily, C. Dobbins, and S. Iram, “Prediction of preterm deliveries from EHG signals using machine learning,” PloS One, vol. 8, no. 10, p. e77154, Oct. 2013.

[10]

K. U. Rani, G. N. Ramadevi, and D. Lavanya, “Performance of synthetic minority oversampling technique on imbalanced breast cancer data,” in Proc. IEEE 3rd Int. Conf. Computing Sustainable Global Development, Mar. 2016, pp. 1623–1627.

[11]

U. R. Acharya, V. K. Sudarshan, S. Q. Rong, Z. Tan, C. M. Lim, J. E. Koh, S. Nayak, and S. V. Bhandary, “Automated detection of premature delivery using empirical mode and wavelet packet decomposition techniques with uterine electromyogram signals,” Comput. Biol. Med., vol. 85, pp. 33–42, May 2017.

[12]

K. Oppedal, K. Engan, T. Eftestol, M. Beyer, and D. Aarsland, “Classifying Alzheimer’s disease, Lewy body dementia, and normal controls using 3D texture analysis in magnetic resonance images,” Biomed. Signal Process. Control, vol. 33, pp. 19–29, Mar. 2017.

[13]

R. Blagus and L. Lusa, “Joint use of over-and under-sampling techniques and cross-validation for the development and assessment of prediction models,” BMC Bioinf., vol. 16, no. 1, pp. 1–10, Nov. 2015.

[14]

G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 20–29, June 2004.

Digital Library

[15]

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, June 2002.

[16]

H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in Proc. IEEE Int. Joint Conf. Neural Networks, June 2008, pp. 1322–1328.

[17]

H. Han, W. Wang, and B. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” in Advances in Intelligent Computing, Aug. 2005, pp. 878–887.

[18]

C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem,” in Advances in Knowledge Discovery and Data Mining, Apr. 2009, pp. 475–482.

[19]

I. Tomek, “Two modifications of CNN,” IEEE Trans. Syst., Man, Cybern.* (1971–1995), vol. 6, pp. 769–772, Nov. 1976.

[20]

D. L. Wilson, “Asymptotic properties of nearest neighbour rules using edited data,” IEEE Trans. Syst., Man, Cybern.* (1971–1995), vol. 2, no. 3, pp. 408–421, July 1972.

[21]

S. Tang and S. Chen, “The generation mechanism of synthetic minority class examples,” in Proc. IEEE Int. Conf. Information Technology and Applications Biomedicine, May 2008, pp. 444–447.

[22]

T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts,” ACM SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 40–49, June 2004.

Digital Library

[23]

G. Cohen, M. Hilario, H. Sax, S. Hugonnet, and A. Geissbuhler, “Learning from imbalanced data in surveillance of nosocomial infection,” Artif. Intell. Med., vol. 37, no. 1, pp. 7–18, May 2006.

Digital Library

[24]

S. Barua, M. M. Islam, X. Yao, and K. Murase, “MWMOTE: Majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 2, pp. 405–425, Nov. 2014.

Digital Library

[25]

J. Stefanowski and S. Wilk, “Selective pre-processing of imbalanced data for improving classification performance,” Lecture Notes Comput. Sci., vol. 5182, pp. 283–292, Sept. 2008.

Digital Library

[26]

K. Napierała, J. Stefanowski, and S. Wilk, “Learning from imbalanced data in presence of noisy and borderline examples,” in Rough Sets and Current Trends in Computing. New York, NY, USA: Springer, June 2010, pp. 158–167.

Digital Library

[27]

T. K. Ho and M. Basu, “Complexity measures of supervised classification problems,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3, pp. 289–300, Mar. 2002.

Digital Library

[28]

P. H. Abreu, M. S. Santos, M. H. Abreu, B. Andrade, and D. C. Silva, “Predicting breast cancer recurrence using machine learning techniques: A systematic review,” ACM Comput. Surv., vol. 49, no. 3, pp. 1–40, Dec. 2016.

Digital Library

[29]

T. K. Ho, “Geometrical complexity of classification problems,” Proc. 7th Course Ensemble Methods Learning Machines Int. School Neural Nets “E.R. Caianiello,” Feb. 2004, pp. 1–15.

[30]

A. Orriols-Puig, N. Macia, and T. K. Ho, Documentation for the Data Complexity Library in C++, vol. 196. La Salle: Universitat Ramon Llull, Dec. 2010, pp. 1–40.

[31]

J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (ROC) curve,” Radiology, vol. 143, no. 1, pp. 29–36, Apr. 1982.

[32]

O. Loyola-González, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, and M. García-Borroto, “Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases,” Neurocomputing, vol. 175, pp. 935–947, Jan. 2016.

Digital Library

[33]

R. Alejo, J. Monroy-de Jesús, J. H. Pacheco-Sánchez, E. López-González, and J. A. Antonio-Velázquez, “A selective dynamic sampling back-propagation approach for handling the two-class imbalance problem,” Appl. Sci., vol. 6, no. 7, pp. 1–17, July 2016.

[34]

W. A. Rivera and P. Xanthopoulos, “A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets,” Expert Syst. Appl., vol. 66, pp. 124–135, Dec. 2016.

[35]

J. A. Sáez, B. Krawczyk, and M. Wo´zniak, “Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets,” Pattern Recog., vol. 57, pp. 164–178, Sept. 2016.

Digital Library

[36]

G. Douzas and F. Bacao, “Self-organizing map oversampling (SOMO) for imbalanced data set learning,” Expert Syst. Appl., vol. 82, pp. 40–52, Oct. 2017.

Digital Library

[37]

S. Shilaskar, A. Ghatol, and P. Chatur, “Medical decision support system for extremely imbalanced datasets,” Inf. Sci., vol. 384, pp. 205–219, Apr. 2017.

[38]

J. Liu, Y. Li, and E. Zio, “A SVM framework for fault detection of the braking system in a high speed train,” Mech. Syst. Signal Process., vol. 87, pp. 401–409, Mar. 2017.

[39]

J. Luengo, A. Fernández, S. García, and F. Herrera, “Addressing data complexity for imbalanced data sets: Analysis of SMOTE-based oversampling and evolutionary undersampling,” Soft Comput., vol. 15, no. 10, pp. 1909–1936, Oct. 2011.

Digital Library

[40]

V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,” Inf. Sci., vol. 250, pp. 113–141, Nov. 2013.

[41]

M. S. Santos, P. H. Abreu, P. J. García-Laencina, A. Simão, and A. Carvalho, “A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients,” J. Biomed. Inform., vol. 58, pp. 49–59, Dec. 2015.

Digital Library

[42]

T. Cali´nski and J. Harabasz, “A dendrite method for cluster analysis,” Commun. Stat.-Theory Methods, vol. 3, no. 1, pp. 1–27, June 1974.

[43]

D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Trans. Pattern Anal. Mach. Intell., no. 2, pp. 224–227, Apr. 1979.

Digital Library

[44]

L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. New York, NY, USA: Wiley, 2009.

[45]

J. Stefanowski, Dealing with Data Difficulty Factors While Learning from Imbalanced Data. New York, NY, USA: Springer International Publishing, June 2016, pp. 333–363.

[46]

J. G. Moreno-Torres, J. A. Sáez, and F. Herrera, “Study on the impact of partition-induced dataset shift on k-fold cross-validation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 8, pp. 1304–1312, June 2012.

[47]

I. Cordón, S. García, A. Fernández, and F. Herrera, Imbalance: Preprocessing algorithms for imbalanced datasets, R package version 1.0.0, Feb. 2018. [Online]. Available: https://cran.r-project.org/web/packages/imbalance. Accessed on: August 27, 2018.

Cited By

Lee HJeong JChoung S(2024)Improvement of performance of in-situ virtual monitoring system of the occurrence probability for high concentrations of naturally occurring radioactive materials in groundwater through the solution of the data imbalance problemEnvironmental Modelling & Software10.1016/j.envsoft.2024.105978175:COnline publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1016/j.envsoft.2024.105978
Muhammed Niyas KParamasivan T(2024)Improving Alzheimer’s classification using a modified Borda count voting method on dynamic ensemble classifiersKnowledge and Information Systems10.1007/s10115-024-02106-666:8(4755-4787)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s10115-024-02106-6
Ren JWang YDeng X(2023)Slack-Factor-Based Fuzzy Support Vector Machine for Class Imbalance ProblemsACM Transactions on Knowledge Discovery from Data10.1145/357905017:6(1-26)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1145/3579050
Show More Cited By

Index Terms

Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier]
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications

Index terms have been assigned to the content through auto-classification.

Recommendations

Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy

Learning with imbalanced data is one of the recent challenges in machine learning. Various solutions have been proposed in order to find a treatment for this problem, such as modifying methods or the application of a preprocessing stage. Within the ...
Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets

A new oversampling method for imbalanced dataset classification is presented.It clusters the minority class and identifies borderline minority instances.Considering majority class during minority class clustering improves oversampling.Cluster size after ...
An Adaptive Oversampling Technique for Imbalanced Datasets
Advances in Data Mining. Applications and Theoretical Aspects
Abstract
Class imbalance is one of the challenging problems in classification domain of data mining. This is particularly so because of the inability of the classifiers in classifying minority examples correctly when data is imbalanced. Further, the ...

Comments

Information & Contributors

Information

Published In

Copyright © 2018.

Publisher

IEEE Press

Publication History

Published: 01 November 2018

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lee HJeong JChoung S(2024)Improvement of performance of in-situ virtual monitoring system of the occurrence probability for high concentrations of naturally occurring radioactive materials in groundwater through the solution of the data imbalance problemEnvironmental Modelling & Software10.1016/j.envsoft.2024.105978175:COnline publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1016/j.envsoft.2024.105978
Muhammed Niyas KParamasivan T(2024)Improving Alzheimer’s classification using a modified Borda count voting method on dynamic ensemble classifiersKnowledge and Information Systems10.1007/s10115-024-02106-666:8(4755-4787)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s10115-024-02106-6
Ren JWang YDeng X(2023)Slack-Factor-Based Fuzzy Support Vector Machine for Class Imbalance ProblemsACM Transactions on Knowledge Discovery from Data10.1145/357905017:6(1-26)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1145/3579050
Nong YSharma RHamou-Lhadj ALuo XCai H(2023)Open Science in Software Engineering: A Study on Deep Learning-Based Vulnerability DetectionIEEE Transactions on Software Engineering10.1109/TSE.2022.320714949:4(1983-2005)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TSE.2022.3207149
Zhao XCheng YLiang LWang HGao XWu J(2023)A balanced random learning strategy for CNN based Landsat image segmentation under imbalanced and noisy labelsPattern Recognition10.1016/j.patcog.2023.109824144:COnline publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1016/j.patcog.2023.109824
Islam MMustafa H(2023)Multi-Layer Hybrid (MLH) balancing techniqueData & Knowledge Engineering10.1016/j.datak.2022.102105143:COnline publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.datak.2022.102105
Hossain ERana RHiggins NSoar JBarua PPisani ATurner K(2023)Natural Language Processing in Electronic Health Records in relation to healthcare decision-makingComputers in Biology and Medicine10.1016/j.compbiomed.2023.106649155:COnline publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1016/j.compbiomed.2023.106649
Abdelkhalek AMashaly M(2023)Addressing the class imbalance problem in network intrusion detection systems using data resampling and deep learningThe Journal of Supercomputing10.1007/s11227-023-05073-x79:10(10611-10644)Online publication date: 12-Feb-2023
https://dl.acm.org/doi/10.1007/s11227-023-05073-x
Chen YPedrycz WYang J(2023)A new boundary-degree-based oversampling method for imbalanced dataApplied Intelligence10.1007/s10489-023-04846-453:22(26518-26541)Online publication date: 25-Aug-2023
https://dl.acm.org/doi/10.1007/s10489-023-04846-4
Li JZhang QLiu LYao YHuang LChen FZhou LZhang B(2023)A refined zenith tropospheric delay model for Mainland China based on the global pressure and temperature 3 (GPT3) model and random forestGPS Solutions10.1007/s10291-023-01513-627:4Online publication date: 15-Jul-2023
https://dl.acm.org/doi/10.1007/s10291-023-01513-6
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents