Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Evidence-based adaptive oversampling algorithm for imbalanced classification

  • Regular paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Classification task is complicated by several facts including skewed class proportion and unclear decision regions due to noise, class overlap, small disjunct, caused by large within-class variation. These issues make data classification difficult, reducing overall performance, and challenging to draw meaningful insights. In this research, the evidence-based adaptive oversampling algorithm (EVA-oversampling) based on Dempster–Shafer theory of evidence is developed for imbalance classification. This technique involves assigning probability regarding class belonging for each instance to represent uncertainty that each data point may hold. Synthetic data points are generated to make up for the under-representation of minority instances on the region with high confidence, thereby strengthening the minority class region. The experiments revealed that the proposed method worked effectively even in situations where imbalanced counts and data complexity would normally pose significant obstacles. This approach performs better than SMOTE, Borderline-SMOTE, ADASYN, MWMOTE, KMeansSMOTE, LoRAS, and SyMProD algorithms in terms of \(F_1\)-measure and G-mean for highly imbalanced data while maintaining the overall performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Algorithm 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

The datasets used in this study are available online at https://archive.ics.uci.edu/ml/index.php.

References

  1. Dal Pozzolo A, Caelen O, Le Borgne Y-A, Waterschoot S, Bontempi G (2014) Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst Appl 41(10):4915–4928

    Article  Google Scholar 

  2. Kelly D, Glavin FG, Barrett E (2022) Dowts–denial-of-wallet test simulator: synthetic data generation for preemptive defence. J Intell Inf Syst, 1–24

  3. Zhang T, Chen J, Li F, Zhang K, Lv H, He S, Xu E (2022) Intelligent fault diagnosis of machines with small & imbalanced data: a state-of-the-art review and possible extensions. ISA Trans 119:152–171

    Article  PubMed  Google Scholar 

  4. Guo R, Liu H, Xie G, Zhang Y (2021) Weld defect detection from imbalanced radiographic images based on contrast enhancement conditional generative adversarial network and transfer learning. IEEE Sens J 21(9):10844–10853

    Article  ADS  Google Scholar 

  5. Hammad M, Alkinani MH, Gupta B, El-Latif A, Ahmed A (2021) Myocardial infarction detection based on deep neural network on imbalanced data. Multimedia Syst, pp 1–13

  6. Azhar NA, Pozi MSM, Din AM, Jatowt A (2022) An investigation of smote based methods for imbalanced datasets with data complexity analysis. IEEE Trans Knowl Data Eng

  7. Santos MS, Abreu PH, Japkowicz N, Fernández A, Santos J (2023) A unifying view of class overlap and imbalance: key concepts, multi-view panorama, and open avenues for research. Information Fusion 89:228–253

    Article  Google Scholar 

  8. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  9. Fernández A, Garcia S, Herrera F, Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905

    Article  MathSciNet  Google Scholar 

  10. Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, pp 878–887. Springer

  11. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining, pp 475–482 . Springer

  12. He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence), pp 1322–1328

  13. Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20

    Article  Google Scholar 

  14. Zhang Y, Li X, Gao L, Wang L, Wen L (2018) Imbalanced data fault diagnosis of rotating machinery using synthetic oversampling and feature learning. J Manuf Syst 48:34–50

    Article  Google Scholar 

  15. Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) Ia-suwo: an improving adaptive semi-unsupervised weighted oversampling for imbalanced classification problems. Knowl-Based Syst 203:106116

    Article  Google Scholar 

  16. Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46(3):563–597

    Article  Google Scholar 

  17. Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in Europe, pp 63–66 . Springer

  18. Onan A (2019) Consensus clustering-based undersampling approach to imbalanced learning. Sci Program 2019

  19. Chen B, Xia S, Chen Z, Wang B, Wang G (2021) Rsmote: a self-adaptive robust smote for imbalanced problems with label noise. Inf Sci 553:397–428

    Article  MathSciNet  Google Scholar 

  20. Dolo KM, Mnkandla E (2022) Modifying the smote and safe-level smote oversampling method to improve performance. In: 4th International conference on wireless, intelligent and distributed environment for communication: WIDECOM 2021, pp 47–59 . Springer

  21. Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425

    Article  Google Scholar 

  22. Kunakorntum I, Hinthong W, Phunchongharn P (2020) A synthetic minority based on probabilistic distribution (symprod) oversampling for imbalanced datasets. IEEE Access 8:114692–114704

    Article  Google Scholar 

  23. Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251

    Article  Google Scholar 

  24. Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O (2021) Loras: an oversampling approach for imbalanced datasets. Mach Learn 110:279–301

    Article  MathSciNet  Google Scholar 

  25. Agrawal A, Viktor HL, Paquet E (2015) Scut: multi-class imbalanced data classification using smote and cluster-based undersampling. In: 2015 7th international joint conference on knowledge discovery, knowledge engineering and knowledge management (IC3k), vol 1, pp 226–234 . IEEE

  26. Alejo R, García V, Pacheco-Sánchez JH (2015) An efficient over-sampling approach based on mean square error back-propagation for dealing with the multi-class imbalance problem. Neural Process Lett 42(3):603–617

    Article  Google Scholar 

  27. Koziarski M, Krawczyk B, Woźniak M (2017) Radial-based approach to imbalanced data oversampling. In: International conference on hybrid artificial intelligence systems, pp 318–327. Springer

  28. Dang XT, Tran DH, Hirose O, Satou K (2015) Spy: A novel resampling method for improving classification performance in imbalanced data. In: 2015 Seventh international conference on knowledge and systems engineering (KSE), pp 280–285. IEEE

  29. Cervantes J, Garcia-Lamont F, Rodriguez L, López A, Castilla JR, Trueba A (2017) Pso-based method for svm classification on skewed data sets. Neurocomputing 228:187–197

    Article  Google Scholar 

  30. Dempster AP (1968) Upper and lower probabilities generated by a random closed interval. Ann Math Stat, pp 957–966

  31. Shafer G (1976) A mathematical theory of evidence, vol 42. Princeton University Press, New Jersey

    Book  Google Scholar 

  32. Chen L, Diao L, Sang J (2019) A novel weighted evidence combination rule based on improved entropy function with a diagnosis application. Int J Distrib Sens Netw 15(1):1550147718823990

    Article  Google Scholar 

  33. Tong Z, Xu P, Denoeux T (2021) An evidential classifier based on Dempster–Shafer theory and deep learning. Neurocomputing 450:275–293

    Article  Google Scholar 

  34. Grina F, Elouedi Z, Lefevre E (2021) Evidential undersampling approach for imbalanced datasets with class-overlapping and noise. In: International conference on modeling decisions for artificial intelligence, pp 181–192. Springer

  35. Grina F, Elouedi Z, Lefevre E (2020) A preprocessing approach for class-imbalanced data using smote and belief function theory. In: Analide C, Novais P, Camacho D, Yin H (eds) Intelligent data engineering and automated learning—IDEAL 2020. Springer, Cham, pp 3–11

    Chapter  Google Scholar 

  36. Grina F, Elouedi Z, Lefèvre E (2021) Uncertainty-aware resampling method for imbalanced classification using evidence theory. In: Vejnarová J, Wilson N (eds) Symbolic and quantitative approaches to reasoning with uncertainty. Springer, Cham, pp 342–353

    Chapter  Google Scholar 

  37. Denoeux T (1995) A k-nearest neighbor classification rule based on Dempster–Shafer theory. IEEE Trans Syst Man Cybern 25(5):804–813

    Article  Google Scholar 

  38. Xiao F, Qin B (2018) A weighted combination method for conflicting evidence in multi-sensor data fusion. Sensors 18(5)

  39. Deng Y (2016) Deng entropy. Chaos Solitons Fract 91:549–553

    Article  ADS  Google Scholar 

  40. Capó M, Pérez A, Lozano JA (2020) An efficient k-means clustering algorithm for tall data. Data Min Knowl Disc 34:776–811

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by the Ministry of Science and Technology, Taiwan under Grant No. MOST 110-2221-E-155-060-MY2.

Author information

Authors and Affiliations

Authors

Contributions

Chen-ju Lin was involved in the conceptualization, methodology, reviewing, supervision, funding acquisition. Florence Leony contributed to the methodology, data curation, formal analysis, visualization, investigation, writing, editing. All authors reviewed the manuscript.

Corresponding author

Correspondence to Florence Leony.

Ethics declarations

Conflict of interest

No potential conflict of interest was reported by the authors.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, Cj., Leony, F. Evidence-based adaptive oversampling algorithm for imbalanced classification. Knowl Inf Syst 66, 2209–2233 (2024). https://doi.org/10.1007/s10115-023-01985-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-01985-5

Keywords