research-article

Modeling of class imbalance using an empirical approach with spambase dataset and random forest classification

Authors:

Kiranmayi Kotipalli,

Shan SuthaharanAuthors Info & Claims

RIIT '14: Proceedings of the 3rd annual conference on Research in information technology

Pages 75 - 80

https://doi.org/10.1145/2656434.2656442

Published: 13 October 2014 Publication History

Get Access

Abstract

Classification of imbalanced data is an important research problem as most of the data encountered in real world systems is imbalanced. Recently a representation learning technique called Synthetic Minority Over-sampling Technique (SMOTE) has been proposed to handle imbalanced data problem. Random Forest (RF) algorithm with SMOTE has been previously used to improve classification performance in minority class over majority class. Although RF with SMOTE demonstrates improved classification performance, the relationship between the classification performance and the imbalanced ratio between the majority and minority classes is not well defined. Therefore mathematical models that describe this relationship is useful especially in the big data environment which suffers from imbalanced data. In this paper, we proposed a mathematical model using an empirical approach applied to the well known Spambase dataset and Random Forest classification approach including its adoption with SMOTE representation learning technique. We have presented a linear model which describes the relationship between true positive classification rate and the imbalanced ratio between the majority and minority classes. This model can help IT researchers to develop better spam filter algorithms.

References

[1]

R. Akbani, S. Kwek, and N. Japkowicz. "Applying support vector machines to imbalanced datasets." Machine Learning: ECML 2004. Springer Berlin Heidelberg, pp. 39--50, 2004.

Google Scholar

[2]

K. Bache, and M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013.

Google Scholar

[3]

D. Benyamin. "A gentle introduction to random forests, ensembles, and performance metrics in a commercial system." http://citizennet.com/blog/ 2012/11/10/random-forests-ensembles-and -performance-metrics.

Google Scholar

[4]

L. Breiman. "Random forests." Machine Learning}, vol, 45, no. 1, pp. 5--32, 2001.

Digital Library

Google Scholar

[5]

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. "SMOTE: synthetic minority oversampling technique." arXiv preprint arXiv:1106.1813, 2011.

Digital Library

Google Scholar

[6]

C. Chen, A. Liaw, and L. Breiman. "Using random forest to learn imbalanced data," University of California, Berkeley, 2004.

Google Scholar

[7]

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. "The WEKA data mining software: an update." ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10--18, 2009.

Digital Library

Google Scholar

[8]

H. He, and E. A. Garcia. "Learning from imbalanced data." IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp.1263--1284, 2009.

Digital Library

Google Scholar

[9]

G. E. Hinton, S. Osindero, and Y. W. Teh. "A fast learning algorithm for deep belief nets." Neural computation, vol.18, no. 7, pp. 1527--1554, 2006.

Digital Library

Google Scholar

[10]

T. M. Khoshgoftaar, M. Golawala, and J. V. Hulse. "An empirical study of learning from imbalanced data using random forest." 19th IEEE International Conference on Tools with Artificial Intelligence, 2007, vol. 2, pp. 310--317, 2007.

Digital Library

Google Scholar

[11]

F. Provost. "Machine learning from imbalanced data sets 101." in Proceedings of the AAAI-2000 workshop on imbalanced data sets, pp. 1--3, 2000.

Google Scholar

[12]

D. Yao, Dengju, J. Yang, and X. Zhan. "An Improved Random Forest Algorithm for Class-Imbalanced Data Classification and its Application in PAD Risk Factors Analysis." Open Electrical and Electronic Engineering Journal, vol. 7, no. 1, pp. 62--70, 2013.

Crossref

Google Scholar

[13]

Weiss, Gary M., and Foster Provost. "The effect of class distribution on classifier learning: an empirical study." Rutgers Univ (2001).

Google Scholar

Cited By

View all

Suthaharan SSuthaharan S(2016)Big Data AnalyticsMachine Learning Models and Algorithms for Big Data Classification10.1007/978-1-4899-7641-3_3(31-75)Online publication date: 2016
https://doi.org/10.1007/978-1-4899-7641-3_3
Suthaharan SSuthaharan S(2016)Big Data EssentialsMachine Learning Models and Algorithms for Big Data Classification10.1007/978-1-4899-7641-3_2(17-29)Online publication date: 2016
https://doi.org/10.1007/978-1-4899-7641-3_2

Index Terms

Modeling of class imbalance using an empirical approach with spambase dataset and random forest classification
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
    2. Simulation theory

Recommendations

An Experimental Assessment of Random Forest Classification Performance Improvisation with Sampling and Stage Wise Success Rate Calculation
Abstract
Imbalanced data classification with Random Forest Classification (RFC) technique has gained huge prominence in today’s application era. Data imbalance between practical applications relates to either binary class imbalance or multiclass imbalance. ...
Over-sampling via under-sampling in strongly imbalanced data

Classification of imbalanced datasets is an important challenge in machine learning. This investigation analysed the effect of ratio imbalance and the selected classifier on the application of several re-sampling strategies to deal with imbalanced ...
An Active Under-Sampling Approach for Imbalanced Data Classification
ISCID '12: Proceedings of the 2012 Fifth International Symposium on Computational Intelligence and Design - Volume 02

An active under-sampling approach is proposed for handling the imbalanced problem in this paper. Traditional classifiers usually assume that training examples are evenly distributed among different classes, so they are often biased to the majority class ...

Comments

Information & Contributors

Information

Published In

RIIT '14: Proceedings of the 3rd annual conference on Research in information technology

October 2014

98 pages

ISBN:9781450327114

DOI:10.1145/2656434

General Chairs:
Becky Rutherfoord
Southern Polytechnic State University, USA
,
Lei Li
Southern Polytechnic State University, USA
,
Susan Van de Ven
Southern Polytechnic State University, USA
,
Program Chairs:
Amber Settle
DePaul University, USA
,
Terry Steinbach
DePaul University, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 October 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGITE/RIIT'14

Sponsor:

SIGITE

SIGITE/RIIT'14: SIGITE/RIIT 2014

October 15 - 18, 2014

Georgia, Atlanta, USA

Acceptance Rates

RIIT '14 Paper Acceptance Rate 14 of 39 submissions, 36%;

Overall Acceptance Rate 51 of 116 submissions, 44%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
307
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Suthaharan SSuthaharan S(2016)Big Data AnalyticsMachine Learning Models and Algorithms for Big Data Classification10.1007/978-1-4899-7641-3_3(31-75)Online publication date: 2016
https://doi.org/10.1007/978-1-4899-7641-3_3
Suthaharan SSuthaharan S(2016)Big Data EssentialsMachine Learning Models and Algorithms for Big Data Classification10.1007/978-1-4899-7641-3_2(17-29)Online publication date: 2016
https://doi.org/10.1007/978-1-4899-7641-3_2

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

An Experimental Assessment of Random Forest Classification Performance Improvisation with Sampling and Stage Wise Success Rate Calculation

Over-sampling via under-sampling in strongly imbalanced data

An Active Under-Sampling Approach for Imbalanced Data Classification

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

An Experimental Assessment of Random Forest Classification Performance Improvisation with Sampling and Stage Wise Success Rate Calculation

Over-sampling via under-sampling in strongly imbalanced data

An Active Under-Sampling Approach for Imbalanced Data Classification

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations