Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2656434.2656442acmconferencesArticle/Chapter ViewAbstractPublication PagesiteConference Proceedingsconference-collections
research-article

Modeling of class imbalance using an empirical approach with spambase dataset and random forest classification

Published: 13 October 2014 Publication History

Abstract

Classification of imbalanced data is an important research problem as most of the data encountered in real world systems is imbalanced. Recently a representation learning technique called Synthetic Minority Over-sampling Technique (SMOTE) has been proposed to handle imbalanced data problem. Random Forest (RF) algorithm with SMOTE has been previously used to improve classification performance in minority class over majority class. Although RF with SMOTE demonstrates improved classification performance, the relationship between the classification performance and the imbalanced ratio between the majority and minority classes is not well defined. Therefore mathematical models that describe this relationship is useful especially in the big data environment which suffers from imbalanced data. In this paper, we proposed a mathematical model using an empirical approach applied to the well known Spambase dataset and Random Forest classification approach including its adoption with SMOTE representation learning technique. We have presented a linear model which describes the relationship between true positive classification rate and the imbalanced ratio between the majority and minority classes. This model can help IT researchers to develop better spam filter algorithms.

References

[1]
R. Akbani, S. Kwek, and N. Japkowicz. "Applying support vector machines to imbalanced datasets." Machine Learning: ECML 2004. Springer Berlin Heidelberg, pp. 39--50, 2004.
[2]
K. Bache, and M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013.
[3]
D. Benyamin. "A gentle introduction to random forests, ensembles, and performance metrics in a commercial system." http://citizennet.com/blog/ 2012/11/10/random-forests-ensembles-and -performance-metrics.
[4]
L. Breiman. "Random forests." Machine Learning}, vol, 45, no. 1, pp. 5--32, 2001.
[5]
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. "SMOTE: synthetic minority oversampling technique." arXiv preprint arXiv:1106.1813, 2011.
[6]
C. Chen, A. Liaw, and L. Breiman. "Using random forest to learn imbalanced data," University of California, Berkeley, 2004.
[7]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. "The WEKA data mining software: an update." ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10--18, 2009.
[8]
H. He, and E. A. Garcia. "Learning from imbalanced data." IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp.1263--1284, 2009.
[9]
G. E. Hinton, S. Osindero, and Y. W. Teh. "A fast learning algorithm for deep belief nets." Neural computation, vol.18, no. 7, pp. 1527--1554, 2006.
[10]
T. M. Khoshgoftaar, M. Golawala, and J. V. Hulse. "An empirical study of learning from imbalanced data using random forest." 19th IEEE International Conference on Tools with Artificial Intelligence, 2007, vol. 2, pp. 310--317, 2007.
[11]
F. Provost. "Machine learning from imbalanced data sets 101." in Proceedings of the AAAI-2000 workshop on imbalanced data sets, pp. 1--3, 2000.
[12]
D. Yao, Dengju, J. Yang, and X. Zhan. "An Improved Random Forest Algorithm for Class-Imbalanced Data Classification and its Application in PAD Risk Factors Analysis." Open Electrical and Electronic Engineering Journal, vol. 7, no. 1, pp. 62--70, 2013.
[13]
Weiss, Gary M., and Foster Provost. "The effect of class distribution on classifier learning: an empirical study." Rutgers Univ (2001).

Cited By

View all
  • (2016)Big Data AnalyticsMachine Learning Models and Algorithms for Big Data Classification10.1007/978-1-4899-7641-3_3(31-75)Online publication date: 2016
  • (2016)Big Data EssentialsMachine Learning Models and Algorithms for Big Data Classification10.1007/978-1-4899-7641-3_2(17-29)Online publication date: 2016

Index Terms

  1. Modeling of class imbalance using an empirical approach with spambase dataset and random forest classification

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      RIIT '14: Proceedings of the 3rd annual conference on Research in information technology
      October 2014
      98 pages
      ISBN:9781450327114
      DOI:10.1145/2656434
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 October 2014

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. classification
      2. imbalanced data
      3. machine learning
      4. random forest
      5. smote

      Qualifiers

      • Research-article

      Conference

      SIGITE/RIIT'14
      Sponsor:
      SIGITE/RIIT'14: SIGITE/RIIT 2014
      October 15 - 18, 2014
      Georgia, Atlanta, USA

      Acceptance Rates

      RIIT '14 Paper Acceptance Rate 14 of 39 submissions, 36%;
      Overall Acceptance Rate 51 of 116 submissions, 44%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)8
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 13 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2016)Big Data AnalyticsMachine Learning Models and Algorithms for Big Data Classification10.1007/978-1-4899-7641-3_3(31-75)Online publication date: 2016
      • (2016)Big Data EssentialsMachine Learning Models and Algorithms for Big Data Classification10.1007/978-1-4899-7641-3_2(17-29)Online publication date: 2016

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media