research-article

Cost-sensitive learning for imbalanced data streams

Authors:

Fabrício Enembreck,

Jean Paul Barddal,

Alceu de Souza Britto, Jr.Authors Info & Claims

SAC '20: Proceedings of the 35th Annual ACM Symposium on Applied Computing

Pages 498 - 504

https://doi.org/10.1145/3341105.3373949

Published: 30 March 2020 Publication History

Abstract

The data imbalance problem hampers the classification task. In streaming environments, this becomes even more cumbersome as the proportion of classes can vary over time. Approaches based on misclassification costs can be used to mitigate this problem. In this paper, we present the Cost-sensitive Adaptive Random Forest (CSARF) and compare it to the Adaptive Random Forest (ARF) and ARF with Resampling (ARF_RE) in six real-world and six synthetic data sets with different class ratios. The empirical study analyzes two misclassification costs strategies of the CSARF and shows that the CSARF obtained statistically superior w.r.t. the average recall and average F1 when compared to ARF.

References

[1]

Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. 1993. Database mining: A performance perspective. IEEE transactions on knowledge and data engineering 5, 6 (1993), 914--925.

[2]

Gustavo EAPA Batista, Andre CPLF Carvalho, and Maria Carolina Monard. 2000. Applying one-sided selection to unbalanced datasets. In Mexican International Conference on Artificial Intelligence. Springer, 315--325.

[3]

Siddhartha Bhattacharyya, Sanjeev Jha, Kurian Tharakunnel, and J Christopher Westland. 2011. Data mining for credit card fraud: A comparative study. Decision Support Systems 50, 3 (2011), 602--613.

Digital Library

[4]

Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. 2010. Moa: Massive online analysis. Journal of Machine Learning Research 11, May (2010), 1601--1604.

Digital Library

[5]

Luis E Boiko, Heitor Gomes, Albert Bifet, and Luiz S Oliveira. 2019. Adaptive Random Forests with Resampling for Imbalanced data Streams. International Joint Conference on Neural Networks (IJCNN) (2019).

[6]

Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5--32.

Digital Library

[7]

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357.

[8]

Chao Chen, Andy Liaw, Leo Breiman, et al. 2004. Using random forest to learn imbalanced data. University of California, Berkeley 110 (2004), 1--12.

[9]

Sheng Chen and Haibo He. 2009. Sera: selectively recursive approach towards nonstationary imbalanced stream data mining. In 2009 International Joint Conference on Neural Networks. IEEE, 522--529.

[10]

Davide Chicco. 2017. Ten quick tips for machine learning in computational biology. BioData mining 10, 1 (2017), 35.

[11]

Andrea Dal Pozzolo, Reid Johnson, Olivier Caelen, Serge Waterschoot, Nitesh V Chawla, and Gianluca Bontempi. 2014. Using HDDT to avoid instances propagation in unbalanced and evolving data streams. In 2014 International Joint Conference on Neural Networks (IJCNN). IEEE, 588--594.

[12]

Jesse Davis and Mark Goadrich. 2006. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning. ACM, 233--240.

Digital Library

[13]

Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research 7, Jan (2006), 1--30.

Digital Library

[14]

Gregory Ditzler and Robi Polikar. 2010. An ensemble based incremental learning framework for concept drift and class imbalance. In The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.

[15]

Gregory Ditzler and Robi Polikar. 2013. Incremental learning of concept drift from streaming imbalanced data. IEEE Transactions on Knowledge and Data Engineering 25, 10 (2013), 2283--2301.

Digital Library

[16]

Pedro Domingos. 1999. Metacost: A general method for making classifiers cost-sensitive. In KDD, Vol. 99. 155--164.

Digital Library

[17]

Pedro Domingos and Geoff Hulten. 2000. Mining high-speed data streams. In Kdd, Vol. 2. 4.

[18]

Charles Elkan. 2001. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, Vol. 17. Lawrence Erlbaum Associates Ltd, 973--978.

Digital Library

[19]

Ryan Elwell and Robi Polikar. 2009. Incremental learning of variable rate concept drift. In International Workshop on Multiple Classifier Systems. Springer, 142--151.

Digital Library

[20]

João Gama, Raquel Sebastião, and Pedro Pereira Rodrigues. 2009. Issues in evaluation of stream learning algorithms. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 329--338.

Digital Library

[21]

João Gama, Raquel Sebastião, and Pedro Pereira Rodrigues. 2013. On evaluating stream learning algorithms. Machine learning 90, 3 (2013), 317--346.

[22]

Xingyu Gao, Zhenyu Chen, Sheng Tang, Yongdong Zhang, and Jintao Li. 2016. Adaptive weighted imbalance learning with application to abnormal activity recognition. Neurocomputing 173 (2016), 1927--1935.

Digital Library

[23]

Adel Ghazikhani, Reza Monsefi, and Hadi Sadoghi Yazdi. 2013. Online cost-sensitive neural network classifiers for non-stationary and imbalanced data streams. Neural computing and applications 23, 5 (2013), 1283--1295.

[24]

Adel Ghazikhani, Reza Monsefi, and Hadi Sadoghi Yazdi. 2014. Online neural network model for non-stationary and imbalanced data stream classification. International Journal of Machine Learning and Cybernetics 5, 1 (2014), 51--62.

[25]

Heitor Murilo Gomes, Jean Paul Barddal, Fabrício Enembreck, and Albert Bifet. 2017. A survey on ensemble learning for data stream classification. ACM Computing Surveys (CSUR) 50, 2 (2017), 23.

Digital Library

[26]

Heitor M Gomes, Albert Bifet, Jesse Read, Jean Paul Barddal, Fabrício Enembreck, Bernhard Pfharinger, Geoff Holmes, and Talel Abdessalem. 2017. Adaptive random forests for evolving data stream classification. Machine Learning 106, 9--10 (2017), 1469--1495.

Digital Library

[27]

Michael Harries and New South Wales. 1999. Splice-2 comparative evaluation: Electricity pricing. (1999).

[28]

Bartosz Krawczyk, Mikel Galar, Łukasz Jeleń, and Francisco Herrera. 2016. Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Applied Soft Computing 38 (2016), 714--726.

Digital Library

[29]

Bartosz Krawczyk, Michal Wozniak, and Gerald Schaefer. 2011. Improving minority class prediction using cost-sensitive ensembles. In 16th Online World Conference on Soft Computing in Industrial Applications.

[30]

Bartosz Krawczyk, Michał Woźniak, and Gerald Schaefer. 2014. Cost-sensitive decision tree ensembles for effective imbalanced classification. Applied Soft Computing 14 (2014), 554--562.

Digital Library

[31]

Ryan N Lichtenwalter and Nitesh V Chawla. 2009. Adaptive methods for classification in arbitrarily imbalanced and drifting data streams. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 53--75.

[32]

Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2009. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2 (2009), 539--550.

Digital Library

[33]

Brian W Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405, 2 (1975), 442--451.

[34]

Nikunj C Oza. 2005. Online bagging and boosting. In 2005 IEEE international conference on systems, man and cybernetics, Vol. 3. Ieee, 2340--2345.

[35]

Stjepan Picek, Annelie Heuser, Alan Jovic, Shivam Bhasin, and Francesco Regazzoni. 2018. The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations. IACR Transactions on Cryptographic Hardware and Embedded Systems (2018).

[36]

Robert E Schapire. 1999. A brief introduction to boosting. In Ijcai, Vol. 99. 1401--1406.

Digital Library

[37]

W Nick Street and YongSeog Kim. 2001. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 377--382.

Digital Library

[38]

Konstantinos Topouzelis. 2008. Oil spill detection by SAR images: dark formation detection, feature extraction and classification algorithms. Sensors 8, 10 (2008), 6642--6659.

[39]

Jason Van Hulse, Taghi M Khoshgoftaar, and Amri Napolitano. 2007. Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on Machine learning. ACM, 935--942.

Digital Library

[40]

Bianca Zadrozny, John Langford, and Naoki Abe. 2003. Cost-Sensitive Learning by Cost-Proportionate Example Weighting. In ICDM, Vol. 3. 435.

[41]

Chong Zhang, Kay Chen Tan, and Ruoxu Ren. 2016. Training cost-sensitive deep belief networks on imbalance data problems. In 2016 international joint conference on neural networks (IJCNN). IEEE, 4362--4367.

Cited By

Usman MChen H(2024)EMRIL: Ensemble Method based on ReInforcement Learning for binary classification in imbalanced drifting data streamsNeurocomputing10.1016/j.neucom.2024.128259605(128259)Online publication date: Nov-2024
https://doi.org/10.1016/j.neucom.2024.128259
Chen YYang XDai H(2024)Cost-sensitive continuous ensemble kernel learning for imbalanced data streams with concept driftKnowledge-Based Systems10.1016/j.knosys.2023.111272284(111272)Online publication date: Jan-2024
https://doi.org/10.1016/j.knosys.2023.111272
Usman MChen H(2024)Bin.INI: An ensemble approach for dynamic data streamsExpert Systems with Applications10.1016/j.eswa.2024.124853256(124853)Online publication date: Dec-2024
https://doi.org/10.1016/j.eswa.2024.124853
Show More Cited By

Index Terms

Cost-sensitive learning for imbalanced data streams
1. Computing methodologies
  1. Machine learning

Recommendations

Cost-Sensitive Learning for Imbalanced Bad Debt Datasets in Healthcare Industry
APCASE '15: Proceedings of the 2015 Asia-Pacific Conference on Computer Aided System Engineering

The research using computational intelligence methods to improve bad debt recovery is imperative due to the rapid increase in the cost of healthcare in the U.S. This study explores effectiveness of using cost-sensitive learning methods to classify the ...
Cost-sensitive continuous ensemble kernel learning for imbalanced data streams with concept drift
Abstract
In stream learning, data continuously arrives over time, often at a very high rate. For imbalanced data streams with concept drift, it becomes essential to simultaneously address classification accuracy and time efficiency. However, existing ...
A cost-sensitive rotation forest algorithm for gene expression data classification

Existing works show that the rotation forest algorithm has competitive performance in terms of classification accuracy for gene expression data. However, most existing works only focus on the classification accuracy and neglect the classification costs. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '20: Proceedings of the 35th Annual ACM Symposium on Applied Computing

March 2020

2348 pages

ISBN:9781450368667

DOI:10.1145/3341105

Conference Chairs:
Chih-Cheng Hung
Kennesaw State University
,
Tomas Cerny
Baylor University
,
Program Chairs:
Dongwan Shin
New Mexico Tech
,
Alessio Bechini
University of Pisa, Italy

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 March 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES)

Conference

SAC '20

Sponsor:

SIGAPP

SAC '20: The 35th ACM/SIGAPP Symposium on Applied Computing

March 30 - April 3, 2020

Brno, Czech Republic

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
455
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)2

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Usman MChen H(2024)EMRIL: Ensemble Method based on ReInforcement Learning for binary classification in imbalanced drifting data streamsNeurocomputing10.1016/j.neucom.2024.128259605(128259)Online publication date: Nov-2024
https://doi.org/10.1016/j.neucom.2024.128259
Chen YYang XDai H(2024)Cost-sensitive continuous ensemble kernel learning for imbalanced data streams with concept driftKnowledge-Based Systems10.1016/j.knosys.2023.111272284(111272)Online publication date: Jan-2024
https://doi.org/10.1016/j.knosys.2023.111272
Usman MChen H(2024)Bin.INI: An ensemble approach for dynamic data streamsExpert Systems with Applications10.1016/j.eswa.2024.124853256(124853)Online publication date: Dec-2024
https://doi.org/10.1016/j.eswa.2024.124853
Chiu CMinku L(2024)Smoclust: synthetic minority oversampling based on stream clustering for evolving data streamsMachine Language10.1007/s10994-023-06420-y113:7(4671-4721)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s10994-023-06420-y
Zhou HYin HDeng XHuang YElkind E(2023)Online harmonizing gradient descent for imbalanced data streams one-pass classificationProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/274(2468-2475)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/274
You DXiao JWang YYan HWu DChen ZShen LWu X(2023)Online Learning From Incomplete and Imbalanced Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.325047235:10(10650-10665)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TKDE.2023.3250472
Pereira Gomes DGrégio AZanata Alves MLisboa de Almeida P(2023)Efficient Prequential AUC-PR Computation2023 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA58977.2023.00335(2222-2227)Online publication date: 15-Dec-2023
https://doi.org/10.1109/ICMLA58977.2023.00335
Usman MChen H(2023)Pro-IDD: Pareto-based ensemble for imbalanced and drifting data streamsKnowledge-Based Systems10.1016/j.knosys.2023.111103282(111103)Online publication date: Dec-2023
https://doi.org/10.1016/j.knosys.2023.111103
Himaja DDondeti VUppalapati SVirupaksha S(2023)Cluster based active learning for classification of evolving streamsEvolutionary Intelligence10.1007/s12065-023-00879-317:4(2167-2191)Online publication date: 15-Sep-2023
https://doi.org/10.1007/s12065-023-00879-3
Tang ZZhou LQi FChen H(2023)An improved lightweight and real-time YOLOv5 network for detection of surface defects on indocalamus leavesJournal of Real-Time Image Processing10.1007/s11554-023-01281-z20:1Online publication date: 9-Feb-2023
https://doi.org/10.1007/s11554-023-01281-z
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents