Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier]

Published: 01 November 2018 Publication History

Abstract

Although cross-validation is a standard procedure for performance evaluation, its joint application with oversampling remains an open question for researchers farther from the imbalanced data topic. A frequent experimental flaw is the application of oversampling algorithms to the entire dataset, resulting in biased models and overly-optimistic estimates.

References

[1]
H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, June 2009.
[2]
V. López, A. Fernández, J. G. Moreno-Torres, and F. Herrera, “Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics,” Expert Syst. Appl., vol. 39, no. 7, pp. 6585–6608, June 2012.
[3]
N. V. Chawla, N. Japkowicz, and A. Kotcz, “Special issue on learning from imbalanced data sets,” ACM SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 1–6, June 2004.
[4]
R. Mollineda, R. Alejo, and J. Sotoca, “The class imbalance problem in pattern classification and learning,” in Proc. II Congreso Español de Informática, Sept. 2007, pp. 978–984.
[5]
V. Ganganwar, “An overview of classification algorithms for imbalanced datasets,” Int. J. Emerging Technol. Adv. Eng., vol. 2, no. 4, pp. 42–47, Apr. 2012.
[6]
U. Bhowan, M. Johnston, M. Zhang, and X. Yao, “Evolving diverse ensembles using genetic programming for classification with unbalanced data,” IEEE Trans. Evol. Comput., vol. 17, no. 3, pp. 368–386, May 2013.
[7]
T. Maciejewski and J. Stefanowski, “Local neighbourhood extension of SMOTE for mining imbalanced data,” in Proc. IEEE Symp. Computational Intelligence and Data Mining, Apr. 2011, pp. 104–111.
[8]
G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst. Appl., vol. 73, pp. 220–239, May 2017.
[9]
P. Fergus, P. Cheung, A. Hussain, D. Al-Jumeily, C. Dobbins, and S. Iram, “Prediction of preterm deliveries from EHG signals using machine learning,” PloS One, vol. 8, no. 10, p. e77154, Oct. 2013.
[10]
K. U. Rani, G. N. Ramadevi, and D. Lavanya, “Performance of synthetic minority oversampling technique on imbalanced breast cancer data,” in Proc. IEEE 3rd Int. Conf. Computing Sustainable Global Development, Mar. 2016, pp. 1623–1627.
[11]
U. R. Acharya, V. K. Sudarshan, S. Q. Rong, Z. Tan, C. M. Lim, J. E. Koh, S. Nayak, and S. V. Bhandary, “Automated detection of premature delivery using empirical mode and wavelet packet decomposition techniques with uterine electromyogram signals,” Comput. Biol. Med., vol. 85, pp. 33–42, May 2017.
[12]
K. Oppedal, K. Engan, T. Eftestol, M. Beyer, and D. Aarsland, “Classifying Alzheimer’s disease, Lewy body dementia, and normal controls using 3D texture analysis in magnetic resonance images,” Biomed. Signal Process. Control, vol. 33, pp. 19–29, Mar. 2017.
[13]
R. Blagus and L. Lusa, “Joint use of over-and under-sampling techniques and cross-validation for the development and assessment of prediction models,” BMC Bioinf., vol. 16, no. 1, pp. 1–10, Nov. 2015.
[14]
G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 20–29, June 2004.
[15]
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, June 2002.
[16]
H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in Proc. IEEE Int. Joint Conf. Neural Networks, June 2008, pp. 1322–1328.
[17]
H. Han, W. Wang, and B. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” in Advances in Intelligent Computing, Aug. 2005, pp. 878–887.
[18]
C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem,” in Advances in Knowledge Discovery and Data Mining, Apr. 2009, pp. 475–482.
[19]
I. Tomek, “Two modifications of CNN,” IEEE Trans. Syst., Man, Cybern.* (1971–1995), vol. 6, pp. 769–772, Nov. 1976.
[20]
D. L. Wilson, “Asymptotic properties of nearest neighbour rules using edited data,” IEEE Trans. Syst., Man, Cybern.* (1971–1995), vol. 2, no. 3, pp. 408–421, July 1972.
[21]
S. Tang and S. Chen, “The generation mechanism of synthetic minority class examples,” in Proc. IEEE Int. Conf. Information Technology and Applications Biomedicine, May 2008, pp. 444–447.
[22]
T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts,” ACM SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 40–49, June 2004.
[23]
G. Cohen, M. Hilario, H. Sax, S. Hugonnet, and A. Geissbuhler, “Learning from imbalanced data in surveillance of nosocomial infection,” Artif. Intell. Med., vol. 37, no. 1, pp. 7–18, May 2006.
[24]
S. Barua, M. M. Islam, X. Yao, and K. Murase, “MWMOTE: Majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 2, pp. 405–425, Nov. 2014.
[25]
J. Stefanowski and S. Wilk, “Selective pre-processing of imbalanced data for improving classification performance,” Lecture Notes Comput. Sci., vol. 5182, pp. 283–292, Sept. 2008.
[26]
K. Napierała, J. Stefanowski, and S. Wilk, “Learning from imbalanced data in presence of noisy and borderline examples,” in Rough Sets and Current Trends in Computing. New York, NY, USA: Springer, June 2010, pp. 158–167.
[27]
T. K. Ho and M. Basu, “Complexity measures of supervised classification problems,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3, pp. 289–300, Mar. 2002.
[28]
P. H. Abreu, M. S. Santos, M. H. Abreu, B. Andrade, and D. C. Silva, “Predicting breast cancer recurrence using machine learning techniques: A systematic review,” ACM Comput. Surv., vol. 49, no. 3, pp. 1–40, Dec. 2016.
[29]
T. K. Ho, “Geometrical complexity of classification problems,” Proc. 7th Course Ensemble Methods Learning Machines Int. School Neural Nets “E.R. Caianiello,” Feb. 2004, pp. 1–15.
[30]
A. Orriols-Puig, N. Macia, and T. K. Ho, Documentation for the Data Complexity Library in C++, vol. 196. La Salle: Universitat Ramon Llull, Dec. 2010, pp. 1–40.
[31]
J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (ROC) curve,” Radiology, vol. 143, no. 1, pp. 29–36, Apr. 1982.
[32]
O. Loyola-González, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, and M. García-Borroto, “Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases,” Neurocomputing, vol. 175, pp. 935–947, Jan. 2016.
[33]
R. Alejo, J. Monroy-de Jesús, J. H. Pacheco-Sánchez, E. López-González, and J. A. Antonio-Velázquez, “A selective dynamic sampling back-propagation approach for handling the two-class imbalance problem,” Appl. Sci., vol. 6, no. 7, pp. 1–17, July 2016.
[34]
W. A. Rivera and P. Xanthopoulos, “A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets,” Expert Syst. Appl., vol. 66, pp. 124–135, Dec. 2016.
[35]
J. A. Sáez, B. Krawczyk, and M. Wo´zniak, “Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets,” Pattern Recog., vol. 57, pp. 164–178, Sept. 2016.
[36]
G. Douzas and F. Bacao, “Self-organizing map oversampling (SOMO) for imbalanced data set learning,” Expert Syst. Appl., vol. 82, pp. 40–52, Oct. 2017.
[37]
S. Shilaskar, A. Ghatol, and P. Chatur, “Medical decision support system for extremely imbalanced datasets,” Inf. Sci., vol. 384, pp. 205–219, Apr. 2017.
[38]
J. Liu, Y. Li, and E. Zio, “A SVM framework for fault detection of the braking system in a high speed train,” Mech. Syst. Signal Process., vol. 87, pp. 401–409, Mar. 2017.
[39]
J. Luengo, A. Fernández, S. García, and F. Herrera, “Addressing data complexity for imbalanced data sets: Analysis of SMOTE-based oversampling and evolutionary undersampling,” Soft Comput., vol. 15, no. 10, pp. 1909–1936, Oct. 2011.
[40]
V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,” Inf. Sci., vol. 250, pp. 113–141, Nov. 2013.
[41]
M. S. Santos, P. H. Abreu, P. J. García-Laencina, A. Simão, and A. Carvalho, “A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients,” J. Biomed. Inform., vol. 58, pp. 49–59, Dec. 2015.
[42]
T. Cali´nski and J. Harabasz, “A dendrite method for cluster analysis,” Commun. Stat.-Theory Methods, vol. 3, no. 1, pp. 1–27, June 1974.
[43]
D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Trans. Pattern Anal. Mach. Intell., no. 2, pp. 224–227, Apr. 1979.
[44]
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. New York, NY, USA: Wiley, 2009.
[45]
J. Stefanowski, Dealing with Data Difficulty Factors While Learning from Imbalanced Data. New York, NY, USA: Springer International Publishing, June 2016, pp. 333–363.
[46]
J. G. Moreno-Torres, J. A. Sáez, and F. Herrera, “Study on the impact of partition-induced dataset shift on k-fold cross-validation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 8, pp. 1304–1312, June 2012.
[47]
I. Cordón, S. García, A. Fernández, and F. Herrera, Imbalance: Preprocessing algorithms for imbalanced datasets, R package version 1.0.0, Feb. 2018. [Online]. Available: https://cran.r-project.org/web/packages/imbalance. Accessed on: August 27, 2018.

Cited By

View all
  • (2024)Improvement of performance of in-situ virtual monitoring system of the occurrence probability for high concentrations of naturally occurring radioactive materials in groundwater through the solution of the data imbalance problemEnvironmental Modelling & Software10.1016/j.envsoft.2024.105978175:COnline publication date: 1-Apr-2024
  • (2024)Improving Alzheimer’s classification using a modified Borda count voting method on dynamic ensemble classifiersKnowledge and Information Systems10.1007/s10115-024-02106-666:8(4755-4787)Online publication date: 1-Aug-2024
  • (2023)Slack-Factor-Based Fuzzy Support Vector Machine for Class Imbalance ProblemsACM Transactions on Knowledge Discovery from Data10.1145/357905017:6(1-26)Online publication date: 1-Mar-2023
  • Show More Cited By

Index Terms

  1. Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier]
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      Publisher

      IEEE Press

      Publication History

      Published: 01 November 2018

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 17 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Improvement of performance of in-situ virtual monitoring system of the occurrence probability for high concentrations of naturally occurring radioactive materials in groundwater through the solution of the data imbalance problemEnvironmental Modelling & Software10.1016/j.envsoft.2024.105978175:COnline publication date: 1-Apr-2024
      • (2024)Improving Alzheimer’s classification using a modified Borda count voting method on dynamic ensemble classifiersKnowledge and Information Systems10.1007/s10115-024-02106-666:8(4755-4787)Online publication date: 1-Aug-2024
      • (2023)Slack-Factor-Based Fuzzy Support Vector Machine for Class Imbalance ProblemsACM Transactions on Knowledge Discovery from Data10.1145/357905017:6(1-26)Online publication date: 1-Mar-2023
      • (2023)Open Science in Software Engineering: A Study on Deep Learning-Based Vulnerability DetectionIEEE Transactions on Software Engineering10.1109/TSE.2022.320714949:4(1983-2005)Online publication date: 1-Apr-2023
      • (2023)A balanced random learning strategy for CNN based Landsat image segmentation under imbalanced and noisy labelsPattern Recognition10.1016/j.patcog.2023.109824144:COnline publication date: 1-Dec-2023
      • (2023)Multi-Layer Hybrid (MLH) balancing techniqueData & Knowledge Engineering10.1016/j.datak.2022.102105143:COnline publication date: 1-Jan-2023
      • (2023)Natural Language Processing in Electronic Health Records in relation to healthcare decision-makingComputers in Biology and Medicine10.1016/j.compbiomed.2023.106649155:COnline publication date: 1-Mar-2023
      • (2023)Addressing the class imbalance problem in network intrusion detection systems using data resampling and deep learningThe Journal of Supercomputing10.1007/s11227-023-05073-x79:10(10611-10644)Online publication date: 12-Feb-2023
      • (2023)A new boundary-degree-based oversampling method for imbalanced dataApplied Intelligence10.1007/s10489-023-04846-453:22(26518-26541)Online publication date: 25-Aug-2023
      • (2023)A refined zenith tropospheric delay model for Mainland China based on the global pressure and temperature 3 (GPT3) model and random forestGPS Solutions10.1007/s10291-023-01513-627:4Online publication date: 15-Jul-2023
      • Show More Cited By

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media