Abstract
Classification task is complicated by several facts including skewed class proportion and unclear decision regions due to noise, class overlap, small disjunct, caused by large within-class variation. These issues make data classification difficult, reducing overall performance, and challenging to draw meaningful insights. In this research, the evidence-based adaptive oversampling algorithm (EVA-oversampling) based on Dempster–Shafer theory of evidence is developed for imbalance classification. This technique involves assigning probability regarding class belonging for each instance to represent uncertainty that each data point may hold. Synthetic data points are generated to make up for the under-representation of minority instances on the region with high confidence, thereby strengthening the minority class region. The experiments revealed that the proposed method worked effectively even in situations where imbalanced counts and data complexity would normally pose significant obstacles. This approach performs better than SMOTE, Borderline-SMOTE, ADASYN, MWMOTE, KMeansSMOTE, LoRAS, and SyMProD algorithms in terms of \(F_1\)-measure and G-mean for highly imbalanced data while maintaining the overall performance.
Similar content being viewed by others
Data Availability
The datasets used in this study are available online at https://archive.ics.uci.edu/ml/index.php.
References
Dal Pozzolo A, Caelen O, Le Borgne Y-A, Waterschoot S, Bontempi G (2014) Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst Appl 41(10):4915–4928
Kelly D, Glavin FG, Barrett E (2022) Dowts–denial-of-wallet test simulator: synthetic data generation for preemptive defence. J Intell Inf Syst, 1–24
Zhang T, Chen J, Li F, Zhang K, Lv H, He S, Xu E (2022) Intelligent fault diagnosis of machines with small & imbalanced data: a state-of-the-art review and possible extensions. ISA Trans 119:152–171
Guo R, Liu H, Xie G, Zhang Y (2021) Weld defect detection from imbalanced radiographic images based on contrast enhancement conditional generative adversarial network and transfer learning. IEEE Sens J 21(9):10844–10853
Hammad M, Alkinani MH, Gupta B, El-Latif A, Ahmed A (2021) Myocardial infarction detection based on deep neural network on imbalanced data. Multimedia Syst, pp 1–13
Azhar NA, Pozi MSM, Din AM, Jatowt A (2022) An investigation of smote based methods for imbalanced datasets with data complexity analysis. IEEE Trans Knowl Data Eng
Santos MS, Abreu PH, Japkowicz N, Fernández A, Santos J (2023) A unifying view of class overlap and imbalance: key concepts, multi-view panorama, and open avenues for research. Information Fusion 89:228–253
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Fernández A, Garcia S, Herrera F, Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, pp 878–887. Springer
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining, pp 475–482 . Springer
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence), pp 1322–1328
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20
Zhang Y, Li X, Gao L, Wang L, Wen L (2018) Imbalanced data fault diagnosis of rotating machinery using synthetic oversampling and feature learning. J Manuf Syst 48:34–50
Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) Ia-suwo: an improving adaptive semi-unsupervised weighted oversampling for imbalanced classification problems. Knowl-Based Syst 203:106116
Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46(3):563–597
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in Europe, pp 63–66 . Springer
Onan A (2019) Consensus clustering-based undersampling approach to imbalanced learning. Sci Program 2019
Chen B, Xia S, Chen Z, Wang B, Wang G (2021) Rsmote: a self-adaptive robust smote for imbalanced problems with label noise. Inf Sci 553:397–428
Dolo KM, Mnkandla E (2022) Modifying the smote and safe-level smote oversampling method to improve performance. In: 4th International conference on wireless, intelligent and distributed environment for communication: WIDECOM 2021, pp 47–59 . Springer
Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Kunakorntum I, Hinthong W, Phunchongharn P (2020) A synthetic minority based on probabilistic distribution (symprod) oversampling for imbalanced datasets. IEEE Access 8:114692–114704
Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251
Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O (2021) Loras: an oversampling approach for imbalanced datasets. Mach Learn 110:279–301
Agrawal A, Viktor HL, Paquet E (2015) Scut: multi-class imbalanced data classification using smote and cluster-based undersampling. In: 2015 7th international joint conference on knowledge discovery, knowledge engineering and knowledge management (IC3k), vol 1, pp 226–234 . IEEE
Alejo R, García V, Pacheco-Sánchez JH (2015) An efficient over-sampling approach based on mean square error back-propagation for dealing with the multi-class imbalance problem. Neural Process Lett 42(3):603–617
Koziarski M, Krawczyk B, Woźniak M (2017) Radial-based approach to imbalanced data oversampling. In: International conference on hybrid artificial intelligence systems, pp 318–327. Springer
Dang XT, Tran DH, Hirose O, Satou K (2015) Spy: A novel resampling method for improving classification performance in imbalanced data. In: 2015 Seventh international conference on knowledge and systems engineering (KSE), pp 280–285. IEEE
Cervantes J, Garcia-Lamont F, Rodriguez L, López A, Castilla JR, Trueba A (2017) Pso-based method for svm classification on skewed data sets. Neurocomputing 228:187–197
Dempster AP (1968) Upper and lower probabilities generated by a random closed interval. Ann Math Stat, pp 957–966
Shafer G (1976) A mathematical theory of evidence, vol 42. Princeton University Press, New Jersey
Chen L, Diao L, Sang J (2019) A novel weighted evidence combination rule based on improved entropy function with a diagnosis application. Int J Distrib Sens Netw 15(1):1550147718823990
Tong Z, Xu P, Denoeux T (2021) An evidential classifier based on Dempster–Shafer theory and deep learning. Neurocomputing 450:275–293
Grina F, Elouedi Z, Lefevre E (2021) Evidential undersampling approach for imbalanced datasets with class-overlapping and noise. In: International conference on modeling decisions for artificial intelligence, pp 181–192. Springer
Grina F, Elouedi Z, Lefevre E (2020) A preprocessing approach for class-imbalanced data using smote and belief function theory. In: Analide C, Novais P, Camacho D, Yin H (eds) Intelligent data engineering and automated learning—IDEAL 2020. Springer, Cham, pp 3–11
Grina F, Elouedi Z, Lefèvre E (2021) Uncertainty-aware resampling method for imbalanced classification using evidence theory. In: Vejnarová J, Wilson N (eds) Symbolic and quantitative approaches to reasoning with uncertainty. Springer, Cham, pp 342–353
Denoeux T (1995) A k-nearest neighbor classification rule based on Dempster–Shafer theory. IEEE Trans Syst Man Cybern 25(5):804–813
Xiao F, Qin B (2018) A weighted combination method for conflicting evidence in multi-sensor data fusion. Sensors 18(5)
Deng Y (2016) Deng entropy. Chaos Solitons Fract 91:549–553
Capó M, Pérez A, Lozano JA (2020) An efficient k-means clustering algorithm for tall data. Data Min Knowl Disc 34:776–811
Acknowledgements
Not applicable.
Funding
This work was supported by the Ministry of Science and Technology, Taiwan under Grant No. MOST 110-2221-E-155-060-MY2.
Author information
Authors and Affiliations
Contributions
Chen-ju Lin was involved in the conceptualization, methodology, reviewing, supervision, funding acquisition. Florence Leony contributed to the methodology, data curation, formal analysis, visualization, investigation, writing, editing. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
No potential conflict of interest was reported by the authors.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lin, Cj., Leony, F. Evidence-based adaptive oversampling algorithm for imbalanced classification. Knowl Inf Syst 66, 2209–2233 (2024). https://doi.org/10.1007/s10115-023-01985-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-01985-5