When is Undersampling Effective in Unbalanced Classification Tasks?

Dal Pozzolo, Andrea; Caelen, Olivier; Bontempi, Gianluca

doi:10.1007/978-3-319-23528-8_13

Andrea Dal Pozzolo¹⁰,
Olivier Caelen¹¹ &
Gianluca Bontempi^10,12

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9284))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

6134 Accesses
45 Citations

Abstract

A well-known rule of thumb in unbalanced classification recommends the rebalancing (typically by resampling) of the classes before proceeding with the learning of the classifier. Though this seems to work for the majority of cases, no detailed analysis exists about the impact of undersampling on the accuracy of the final classifier. This paper aims to fill this gap by proposing an integrated analysis of the two elements which have the largest impact on the effectiveness of an undersampling strategy: the increase of the variance due to the reduction of the number of samples and the warping of the posterior distribution due to the change of priori probabilities. In particular we will propose a theoretical analysis specifying under which conditions undersampling is recommended and expected to be effective. It emerges that the impact of undersampling depends on the number of samples, the variance of the classifier, the degree of imbalance and more specifically on the value of the posterior probability. This makes difficult to predict the average effectiveness of an undersampling strategy since its benefits depend on the distribution of the testing points. Results from several synthetic and real-world unbalanced datasets support and validate our findings.

Download to read the full chapter text

Chapter PDF

An empirical study on the joint impact of feature selection and data resampling on imbalance classification

Article 23 June 2022

A Membership Probability–Based Undersampling Algorithm for Imbalanced Data

Article 14 January 2020

Revisiting Class Imbalance: A Generalized Notion for Oversampling

Keywords

References

Newman, D.J., Asuncion, A.: UCI machine learning repository (2007)
Google Scholar
Anyfantis, D., Karagiannopoulos, M., Kotsiantis, S., Pintelas, P.: Robustness of learning techniques in handling class noise in imbalanced datasets. In: Artificial intelligence and innovations 2007: From Theory to Applications, pp. 21–28. Springer (2007)
Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: Balancing Strategies and Class Overlapping. In: Famili, A.F., Kok, J.N., Peña, J.M., Siebes, A., Feelders, A. (eds.) IDA 2005. LNCS, vol. 3646, pp. 24–35. Springer, Heidelberg (2005)
Chapter Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1), 20–29 (2004)
Google Scholar
Dal Pozzolo, A., Caelen, O., Borgne, Y.-A.L., Waterschoot, S., Bontempi, G.: Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications 41(10), 4915–4928 (2014)
Google Scholar
Dal Pozzolo, A., Caelen, O., Waterschoot, S., Bontempi, G.: Racing for Unbalanced Methods Selection. In: Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise, T., Li, B., Yao, X. (eds.) IDEAL 2013. LNCS, vol. 8206, pp. 24–31. Springer, Heidelberg (2013)
Chapter Google Scholar
Domingos, P.: Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164. ACM (1999)
Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. In: International Joint Conference on Artificial Intelligence, Citeseer, vol. 17, pp. 973–978 (2001)
Google Scholar
Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Computational Intelligence 20(1), 18–36 (2004)
Article MathSciNet Google Scholar
García, V., Mollineda, R.A., Sánchez, J.S.: On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Analysis and Applications 11(3–4), 269–280 (2008)
Google Scholar
Garc\’ıa, V., Sánchez, J., Mollineda, R.A.: An Empirical Study of the Behavior of Classifiers on Imbalanced and Overlapped Data Sets. In: Rueda, L., Mery, D., Kittler, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 397–406. Springer, Heidelberg (2007)
Google Scholar
Hartley, H.O., Ross, A.: Unbiased ratio estimators (1954)
Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009)
Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6(5), 429–449 (2002)
MATH Google Scholar
Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter 6(1), 40–49 (2004)
Article MathSciNet Google Scholar
Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS (LNAI), vol. 2972, pp. 312–321. Springer, Heidelberg (2004)
Chapter Google Scholar
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset shift in machine learning. The MIT Press (2009)
Google Scholar
Saerens, M., Latinne, P., Decaestecker, C.: Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation 14(1), 21–41 (2002)
Article MATH Google Scholar
Stefanowski, J.: Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Emerging Paradigms in Machine Learning, pp. 277–306. Springer (2013)
Google Scholar
Van Hulse, J., Khoshgoftaar, T.: Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68(12), 1513–1542 (2009)
Article Google Scholar
Wang, S., Tang, K., Yao, X.: Diversity exploration and negative correlation learning on imbalanced data sets. In: International Joint Conference on Neural Networks, IJCNN 2009, pp. 3259–3266. IEEE (2009)
Google Scholar
Weiss, G.M.: Foster Provost. The effect of class distribution on classifier learning: an empirical study. Rutgers Univ (2001)
Google Scholar
Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. In: Data Mining, ICDM, pp. 435–442. IEEE (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Machine Learning Group (MLG), Computer Science Department, Faculty of Sciences ULB, Université Libre de Bruxelles, Brussels, Belgium
Andrea Dal Pozzolo & Gianluca Bontempi
Fraud Risk Management Analytics, Worldline, Brussels, Belgium
Olivier Caelen
Interuniversity Institute of Bioinformatics in Brussels (IB)2, Brussels, Belgium
Gianluca Bontempi

Authors

Andrea Dal Pozzolo
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Caelen
View author publications
You can also search for this author in PubMed Google Scholar
Gianluca Bontempi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrea Dal Pozzolo .

Editor information

Editors and Affiliations

University of Bari Aldo Moro, Bari, Italy
Annalisa Appice
University of Porto, Porto, Portugal
Pedro Pereira Rodrigues
University of Porto - CRACS/INESC TEC, Porto, Portugal
Vítor Santos Costa
University of Porto - INESC TEC, Porto, Portugal
Carlos Soares
University of Porto - INESC TEC, Porto, Portugal
João Gama
University of Porto - INESC TEC, Porto, Portugal
Alípio Jorge

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dal Pozzolo, A., Caelen, O., Bontempi, G. (2015). When is Undersampling Effective in Unbalanced Classification Tasks?. In: Appice, A., Rodrigues, P., Santos Costa, V., Soares, C., Gama, J., Jorge, A. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2015. Lecture Notes in Computer Science(), vol 9284. Springer, Cham. https://doi.org/10.1007/978-3-319-23528-8_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-23528-8_13
Published: 29 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23527-1
Online ISBN: 978-3-319-23528-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

When is Undersampling Effective in Unbalanced Classification Tasks?

Abstract

Chapter PDF

Similar content being viewed by others

An empirical study on the joint impact of feature selection and data resampling on imbalance classification

A Membership Probability–Based Undersampling Algorithm for Imbalanced Data

Revisiting Class Imbalance: A Generalized Notion for Oversampling

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

When is Undersampling Effective in Unbalanced Classification Tasks?

Abstract

Chapter PDF

Similar content being viewed by others

An empirical study on the joint impact of feature selection and data resampling on imbalance classification

A Membership Probability–Based Undersampling Algorithm for Imbalanced Data

Revisiting Class Imbalance: A Generalized Notion for Oversampling

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation