Abstract
Current research on imbalanced data recognises that class imbalance is aggravated by other data intrinsic characteristics, among which class overlap stands out as one of the most harmful. The combination of these two problems creates a new and difficult scenario for classification tasks and has been discussed in several research works over the past two decades. In this paper, we argue that despite some insightful information can be derived from related research, the joint-effect of class overlap and imbalance is still not fully understood, and advocate for the need to move towards a unified view of the class overlap problem in imbalanced domains. To that end, we start by performing a thorough analysis of existing literature on the joint-effect of class imbalance and overlap, elaborating on important details left undiscussed on the original papers, namely the impact of data domains with different characteristics and the behaviour of classifiers with distinct learning biases. This leads to the hypothesis that class overlap comprises multiple representations, which are important to accurately measure and analyse in order to provide a full characterisation of the problem. Accordingly, we devise two novel taxonomies, one for class overlap measures and the other for class overlap-based approaches, both resonating with the distinct representations of class overlap identified. This paper therefore presents a global and unique view on the joint-effect of class imbalance and overlap, from precursor work to recent developments in the field. It meticulously discusses some concepts taken as implicit in previous research, explores new perspectives in light of the limitations found, and presents new ideas that will hopefully inspire researchers to move towards a unified view on the problem and the development of suitable strategies for imbalanced and overlapped domains.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The reader may find supporting information in the supplementary material online at https://student.dei.uc.pt/~miriams/pdf-files/AIR_2021_Appendix.pdf.
The interested reader may find detailed information on the performance of each classifier in the supplementary material provided online at https://student.dei.uc.pt/~miriams/pdf-files/AIR_2021_Appendix.pdf.
References
Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251
Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: European conference on machine learning. Springer, pp 39–50
Alejo R, Valdovinos RM, García V, Pacheco-Sanchez JH (2013) A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recogn Lett 34(4):380–388
Anwar N, Jones G, Ganesh S (2014) Measurement of data complexity for classification problems with unbalanced data. Stat Anal Data Min ASA Data Sci J 7(3):194–211
Armano G, Tamponi E (2016) Experimenting multiresolution analysis for identifying regions of different classification complexity. Pattern Anal Appl 19(1):129–137
Barandela R, Valdovinos RM, Sánchez JS (2003) New applications of ensembles of classifiers. Pattern Anal Appl 6(3):245–256
Barella VH, Costa EP, Carvalho A, Pl F (2014) Clusteross: a new undersampling method for imbalanced learning. In: Proceedings of the 3th Brazilian conference on intelligent systems. Academic Press
Barella VH, Garcia LP, de Souto MP, Lorena AC, de Carvalho A (2018) Data complexity measures for imbalanced classification tasks. In: 2018 international joint conference on neural networks (IJCNN). IEEE, pp 1–8
Barella VH, Garcia LP, de Souto MC, Lorena AC, de Carvalho AC (2021) Assessing the data complexity of imbalanced datasets. Inf Sci 553:83–109
Barua S, Islam M, Yao X, Murase K (2014) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Batuwita R, Palade V (2010) Fsvm-cil: fuzzy support vector machines for class imbalance learning. IEEE Trans Fuzzy Syst 18(3):558–571
Bi J, Zhang C (2018) An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme. Knowl Based Syst 158:81–93
Borsos Z, Lemnaru C, Potolea R (2018) Dealing with overlap and imbalance: a new metric and approach. Pattern Anal Appl 21(2):381–395
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Bunkhumpornpat C, Sinapiromsaran K (2017) Dbmute: density-based majority under-sampling technique. Knowl Inf Syst 50(3):827–850
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 475–482
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2011) Mute: majority under-sampling technique. In: 2011 8th international conference on information, communications and signal processing. IEEE, pp 1–4
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) Dbsmote: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684
Cao H, Li XL, Woon DYK, Ng SK (2013) Integrated oversampling for imbalanced time series classification. IEEE Trans Knowl Data Eng 25(12):2809–2822
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery. Springer, pp 107–119
Chen S (2017) An improved synthetic minority over-sampling technique for imbalanced data set learning. Degree thesis of Department of Information Engineering, National Tsing Hua University, pp 1–59
Chen S, He H, Garcia EA (2010) Ramoboost: ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642
Chen L, Fang B, Shang Z, Tang Y (2018) Tackling class overlap and imbalance problems in software defect prediction. Softw Qual J 26(1):97–125
Chen X, Zhang L, Wei X, Lu X (2021) An effective method using clustering-based adaptive decomposition and editing-based diversified oversamping for multi-class imbalanced datasets. Appl Intell 51(4):1918–1933
Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: GrC, Citeseer, pp 732–737
Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18
Correia A, Soares C, Jorge A (2019) Dataset morphing to analyze the performance of collaborative filtering. In: International conference on discovery science. Springer, pp 29–39
Costa AJ, Santos MS, Soares C, Abreu PH (2020) Analysis of imbalance strategies recommendation using a meta-learning approach. In: 7th ICML workshop on automated machine learning (AutoML-ICML2020), pp 1–10
Cummins L (2013) Combining and choosing case base maintenance algorithms. PhD thesis, University College Cork
Das B, Krishnan NC, Cook DJ (2014a) Handling imbalanced and overlapping classes in smart environments prompting dataset. In: Data mining for service. Springer, pp 199–219
Das B, Krishnan NC, Cook DJ (2014b) Racog and wracog: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234
Das S, Datta S, Chaudhuri B (2018) Handling data irregularities in classification: foundations, trends, and future challenges. Pattern Recogn 81:674–693
de Melo VV, Lorena AC (2018) Using complexity measures to evolve synthetic classification datasets. In: 2018 International joint conference on neural networks (IJCNN). IEEE, pp 1–8
Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Trans Evol Comput 6(2):182–197
Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Canadian conference on artificial intelligence. Springer, pp 220–231
Douzas G, Bacao F (2019) Geometric smote a geometrically enhanced drop-in replacement for smote. Inf Sci 501:118–135
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20
Eshelman LJ (1991) The chc adaptive search algorithm: How to have safe search when engaging in nontraditional genetic recombination. In: Foundations of genetic algorithms, vol 1. Elsevier, pp 265–283
Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231
Fan Q, Wang Z, Li D, Gao D, Zha H (2017) Entropy-based fuzzy support vector machine for imbalanced datasets. Knowl Based Syst 115:87–99
Fernandes ER, de Carvalho AC (2019) Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning. Inf Sci 494:141–154
Fernández A, García S, Galar M, Prati R, Krawczyk B, Herrera F (2018a) Data Intrinsic Characteristics. Springer, Cham, pp 253–277
Fernández A, García S, Galar M, Prati R, Krawczyk B, Herrera F (2018b) Ensemble Learning. Springer, Cham, pp 147–196
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018c) Dimensionality reduction for imbalanced learning. In: Learning from imbalanced data sets. Springer, pp 227–251
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018d) Learning From Imbalanced Data Sets, vol 11. Springer, Berlin
Fernández A, Garcia S, Herrera F, Chawla NV (2018e) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
França TR, Miranda PB, Prudêncio RB, Lorenaz AC, Nascimento AC (2020) A many-objective optimization approach for complexity-based data set generation. In: 2020 IEEE congress on evolutionary computation (CEC). IEEE, pp 1–8
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Friedman J, Hastie T, Tibshirani R et al (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 28(2):337–407
Fu GH, Wu YJ, Zong MJ, Yi LZ (2020) Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics. Chemom Intell Lab Syst 196:103906
Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2013) Dynamic classifier selection for one-vs-one strategy: avoiding non-competent classifiers. Pattern Recogn 46(12):3412–3424
Galar M, Fernández A, Barrenechea E, Herrera F (2015) Drcw-ovo: distance-based relative competence weighting combination for one-vs-one strategy in multi-class problems. Pattern Recogn 48(1):28–42
García S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306
García V, Alejo R, Sánchez J, Sotoca J, Mollineda R (2006) Combined effects of class imbalance and class overlap on instance-based classification. In: International conference on intelligent data engineering and automated learning. Springer, pp 371–378
García V, Mollineda R, Sánchez J, Alejo R, Sotoca J (2007a) When overlapping unexpectedly alters the class imbalance effects. In: Iberian conference on pattern recognition and image analysis. Springer, pp 499–506
García V, Sánchez J, Mollineda R (2007b) An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Iberoamerican congress on pattern recognition. Springer, pp 397–406
García V, Mollineda R, Sánchez J (2008) On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280
García V, Sánchez J, Marqués A, Florencia R, Rivera G (2020) Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Syst Appl 158:113026
Greene J (2001) Feature subset selection using thornton’s separability index and its applicability to a number of sparse proximity-based classifiers. In: Proceedings of annual symposium of the pattern recognition association of South Africa
Guzmán-Ponce A, Valdovinos RM, Sánchez JS, Marcial-Romero JR (2020) A new under-sampling method to face class overlap and imbalance. Appl Sci 10(15):5164
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887
Hart P (1968) The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory 14(3):515–516
He H, Bai Y, Garcia E, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE, pp 1322–1328
Ho T, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300
Huttenlocher DP, Klanderman GA, Rucklidge WJ (1993) Comparing images using the hausdorff distance. IEEE Trans Pattern Anal Mach Intell 15(9):850–863
Jain A, Duin R, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37
Japkowicz N (2001) Concept-learning in the presence of between-class and within-class imbalances. In: Conference of the Canadian society for computational studies of intelligence. Springer, pp 67–77
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explor Newsl 6(1):40–49
Kang S, Cho S, Kang P (2015) Constructing a multi-class classifier using one-against-one approach with different binary classifiers. Neurocomputing 149:677–682
Kaur H, Pannu HS, Malhi AK (2019) A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput Surv (CSUR) 52(4):1–36
Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 83:105662
Koziarski M, Wozniak M (2017) Ccr: a combined cleaning and resampling algorithm for imbalanced data classification. Int J Appl Math Comput Sci 27(4):727–736
Koziarski M, Krawczyk B, Wozniak M (2019) Radial-based oversampling for noisy imbalanced data classification. Neurocomputing 343:19–33
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr. Artif. Intell. 5(4):221–232
Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. Icml Citeseer 97:179–186
Lango M, Brzezinski D, Firlik S, Stefanowski J (2017) Discovering minority sub-clusters and local difficulty factors from imbalanced data. In: International conference on discovery science. Springer, pp 324–339
Lango M, Brzezinski D, Stefanowski J (2018) Imweights: classifying imbalanced data using local and neighborhood information. In: Second international workshop on learning with imbalanced domains: theory and applications, PMLR, pp 95–109
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in Europe. Springer, pp 63–66
Lee HK, Kim SB (2018) An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst Appl 98:72–83
Leyva E, González A, Perez R (2014) A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans Knowl Data Eng 27(2):354–367
Li KS, Wang HR, Liu KH (2019) A novel error-correcting output codes algorithm based on genetic programming. Swarm Evol Comput 50:100564
Liu C (2008) Partial discriminative training for classification of overlapping classes in document analysis. IJDAR 11(2):53
Liu XY, Wu J, Zhou ZH (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Lorena AC, Costa IG, Spolaôr N, De Souto MC (2012) Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75(1):33–42
Lorena AC, Garcia LP, Lehmann J, Souto MC, Ho TK (2019) How complex is your classification problem? A survey on measuring classification complexity. ACM Comput Surv (CSUR) 52(5):1–34
Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936
MacCuish J, MacCuish N (2010) Clustering in Bioinformatics and Drug Discovery. CRC Press, London
Macià N, Bernadó-Mansilla E (2014) Towards uci+: a mindful repository design. Inf Sci 261:237–262
Malina W (2001) Two-parameter fisher criterion. IEEE Trans Syst Man Cybern Part B (Cybern) 31(4):629–636
Mani I, Zhang I (2003) knn approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets, ICML United States, vol 126
Manukyan A, Ceyhan E (2016) Classification of imbalanced data with a geometric digraph family. J Mach Learn Res 17(1):6504–6543
Massie S, Craw S, Wiratunga N (2005) Complexity-guided case discovery for case based reasoning. AAAI 5:216–221
Menzies T, Butcher A, Cok D, Marcus A, Layman L, Shull F, Turhan B, Zimmermann T (2012) Local versus global lessons for defect prediction and effort estimation. IEEE Trans Softw Eng 39(6):822–834
Mercier M, Santos M, Abreu P, Soares C, Soares J, Santos J (2018) Analysing the footprint of classifiers in overlapped and imbalanced contexts. In: International symposium on intelligent data analysis. Springer, pp 200–212
Muñoz MA, Villanova L, Baatar D, Smith-Miles K (2018) Instance spaces for machine learning classification. Mach Learn 107(1):109–147
Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46(3):563–597
Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: International conference on rough sets and current trends in computing. Springer, pp 158–167
Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (a-suwo) for imbalanced datasets. Expert Syst Appl 46:405–416
Oh S (2011) A new dataset evaluation method based on category overlap. Comput Biol Med 41(2):115–122
Orriols-Puig A, Macia N, Ho TK (2010) Documentation for the data complexity library in c++. Universitat Ramon Llull, La Salle 196:1–40
Pascual-Triana JD, Charte D, Andrés Arroyo M, Fernández A, Herrera F (2021) Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect. Knowl Inf Syst 63(7):1961–1989
Prati RGB, Monard M (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Mexican international conference on artificial intelligence. Springer, pp 312–321
Rivolli A, Garcia LP, Soares C, Vanschoren J, de Carvalho AC (2018) Characterizing classification datasets: a study of meta-features for meta-learning. arXiv:180810406
Sáez J, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Sáez JA, Galar M, Krawczyk B (2019) Addressing the overlapping data problem in classification using the one-vs-one decomposition strategy. IEEE Access 7:83396–83411
Santos M, Abreu P, García-Laencina P, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59
Santos M, Soares J, Abreu P, Araújo H, Santos J (2018) Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches. IEEE Comput Intell Mag 13(3):59–76
Santoso B, Wijayanto H, Notodiputro KA, Sartono B (2018) K-neighbor over-sampling with cleaning data: a new approach to improve classification performance in data sets with class imbalance. Appl Math Sci 12(10):449–460
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man, Cybern Part A Syst Hum 40(1):185–197
Selvaraj G, Kaliamurthi S, Kaushik A, Khan A, Wei Y, Cho W, Gu K, Wei D (2018) Identification of target gene and prognostic evaluation for lung adenocarcinoma using gene expression meta-analysis, network analysis and neural network algorithms. J Biomed Inform 86:120–134
Shilaskar S, Ghatol A, Chatur P (2017) Medical decision support system for extremely imbalanced datasets. Inf Sci 384:205–219
Singh S (2003a) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell 25(12):1534–1539
Singh S (2003b) Prism-a novel framework for pattern recognition. Pattern Anal Appl 6(2):134–149
Singh D, Gosain A, Saha A (2020) Weighted k-nearest neighbor based data complexity metrics for imbalanced datasets. Stat Anal Data Min ASA Data Sci J 13(4):394–404
Slowik A, Kwasnicka H (2020) Evolutionary algorithms and their applications to engineering problems. Neural Comput Appl 32(16):12363–12379
Smith MR, Martinez T, Giraud-Carrier C (2014) An instance level analysis of data complexity. Mach Learn 95(2):225–256
Sotoca JM, Sanchez J, Mollineda RA (2005) A review of data complexity measures and their applicability to pattern classification problems. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje TAMIDA, pp 77–83
Sotoca JM, Mollineda RA, Sánchez JS (2006) A meta-learning framework for pattern classication by means of data complexity measures. Inteligencia Artificial Revista Iberoamericana de Inteligencia Artificial 10(29):31–38
Sowah RA, Agebure MA, Mills GA, Koumadi KM, Fiawoo SY (2016) New cluster undersampling technique for class imbalance learning. Int J Mach Learn Comput 6(3):205
Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Emerging paradigms in machine learning. Springer, pp 277–306
Stefanowski J (2016) Dealing with data difficulty factors while learning from imbalanced data. In: Challenges in computational statistics and data mining. Springer, pp 333–363
Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: International conference on data warehousing and knowledge discovery. Springer, pp 283–292
Tang Y, Gao J (2007) Improved classification for problem involving overlapping patterns. IEICE Trans Inf Syst 90(11):1787–1795
Tang W, Mao K, Mak LO, Ng GW (2010) Classification for overlapping classes using optimized overlapping region detection and soft decision. In: 2010 13th international conference on information fusion. IEEE, pp 1–8
Thornton C (1998) Separability is a learner’s best friend. In: 4th Neural computation and psychology workshop, London, 9–11 April 1997. Springer, pp 40–46
Tomek I (1976) Two modifications of cnn. IEEE Trans Syst Man Commun 6:769–772
Vorraboot P, Rasmequan S, Chinnasarn K, Lursinsap C (2015) Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms. Neurocomputing 152:429–443
Vuttipittayamongkol P, Elyan E (2020a) Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and Parkinson’s disease. Int J Neural Syst 30(08):2050043
Vuttipittayamongkol P, Elyan E (2020b) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70.
Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C (2018) Overlap-based undersampling for improving imbalanced data classification. In: International conference on intelligent data engineering and automated learning. Springer, pp 689–697
Vuttipittayamongkol P, Elyan E, Petrovski A (2020) On the class overlap problem in imbalanced data classification. Knowl Based Syst 106631
Van der Walt CM, Barnard E (2007) Measures for the characterisation of pattern-recognition data sets. In: 18th Annual symposium of the pattern recognition association of South Africa
Van der Walt CM, et al. (2008) Data measures that characterise classification problems. PhD thesis, University of Pretoria
Wang BX, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25(1):1–20
Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE symposium on computational intelligence and data mining. IEEE, pp 324–331
Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443
Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020a) Ia-suwo: an improving adaptive semi-unsupervised weighted oversampling for imbalanced classification problems. Knowl Based Syst 203:106116
Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020b) Ni-mwmote: an improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 158:113504
Weng CG, Poon J (2006) A data complexity analysis on imbalanced datasets and an alternative imbalance recovering strategy. In: 2006 IEEE/WIC/ACM international conference on web intelligence (WI 2006 main conference proceedings) (WI’06). IEEE, pp 270–276
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421
Wojciechowski S, Wilk S (2017) Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data. Found Comput Decis Sci 42(2):149–176
Wozniak M, Grana M, Corchado E (2014) A survey of multiple classifier systems as hybrid systems. Inf Fusion 16:3–17
Xiong H, Wu J, Liu L (2010) classification with classoverlapping: a systematic study. In: Proceedings of the 1st international conference on E-Business intelligence (ICEBI2010). Atlantis Press
Yan Y, Liu R, Ding Z, Du X, Chen J, Zhang Y (2019) A parameter-free cleaning method for smote in imbalanced classification. IEEE Access 7:23537–23548
Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Zhu C, Wang Z (2017) Entropy-based matrix learning machine for imbalanced data sets. Pattern Recogn Lett 88:72–80
Zhu T, Lin Y, Liu Y (2017) Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recogn 72:327–340
Zhu T, Lin Y, Liu Y (2020a) Improving interpolation-based oversampling for imbalanced data learning. Knowl-Based Syst 187:104826
Zhu Y, Yan Y, Zhang Y, Zhang Y (2020b) Ehso: evolutionary hybrid sampling in overlapping scenarios for imbalanced learning. Neurocomputing 417:333–346
Acknowledgements
This work is funded by national funds through the FCT-Foundation for Science and Technology, I.P., within the scope of the project CISUC-UID/CEC/00326/2020 and by European Social Fund, through the Regional Operational Program Centro 2020. This work is also partially supported by Andalusian frontier regional project A-TIC-434-UGR20 and by the Spanish Ministry of Science and Technology under project PID2020-119478GB-I00 including European Regional Development Funds. This work was also partially funded by the project Safe Cities-Inovação para Construir Cidades Seguras, with the reference POCI-01-0247-FEDER-041435, co-funded by the European Regional Development Fund (ERDF), through the Operational Programme for Competitiveness and Internationalization (COMPETE 2020), under the PORTUGAL 2020 Partnership Agreement. The work is further supported by the FCT Research Grant SFRH/BD/138749/2018.
Author information
Authors and Affiliations
Contributions
MSS Conceptualisation, Methodology, Literature Search, Investigation, Formal Analysis, Writing—Original Draft, Writing—Review and Editing, Visualisation. PHA Conceptualisation, Validation, Writing—Review and Editing, Supervision. NJ Validation, Writing—Review and Editing. AF Validation, Writing—Review and Editing. CS Validation, Writing—Review and Editing. SW Validation, Writing—Review and Editing. JS Writing—Review and Editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Code availability
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Santos, M.S., Abreu, P.H., Japkowicz, N. et al. On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev 55, 6207–6275 (2022). https://doi.org/10.1007/s10462-022-10150-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-022-10150-3