Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3341105.3373949acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Cost-sensitive learning for imbalanced data streams

Published: 30 March 2020 Publication History
  • Get Citation Alerts
  • Abstract

    The data imbalance problem hampers the classification task. In streaming environments, this becomes even more cumbersome as the proportion of classes can vary over time. Approaches based on misclassification costs can be used to mitigate this problem. In this paper, we present the Cost-sensitive Adaptive Random Forest (CSARF) and compare it to the Adaptive Random Forest (ARF) and ARF with Resampling (ARFRE) in six real-world and six synthetic data sets with different class ratios. The empirical study analyzes two misclassification costs strategies of the CSARF and shows that the CSARF obtained statistically superior w.r.t. the average recall and average F1 when compared to ARF.

    References

    [1]
    Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. 1993. Database mining: A performance perspective. IEEE transactions on knowledge and data engineering 5, 6 (1993), 914--925.
    [2]
    Gustavo EAPA Batista, Andre CPLF Carvalho, and Maria Carolina Monard. 2000. Applying one-sided selection to unbalanced datasets. In Mexican International Conference on Artificial Intelligence. Springer, 315--325.
    [3]
    Siddhartha Bhattacharyya, Sanjeev Jha, Kurian Tharakunnel, and J Christopher Westland. 2011. Data mining for credit card fraud: A comparative study. Decision Support Systems 50, 3 (2011), 602--613.
    [4]
    Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. 2010. Moa: Massive online analysis. Journal of Machine Learning Research 11, May (2010), 1601--1604.
    [5]
    Luis E Boiko, Heitor Gomes, Albert Bifet, and Luiz S Oliveira. 2019. Adaptive Random Forests with Resampling for Imbalanced data Streams. International Joint Conference on Neural Networks (IJCNN) (2019).
    [6]
    Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5--32.
    [7]
    Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357.
    [8]
    Chao Chen, Andy Liaw, Leo Breiman, et al. 2004. Using random forest to learn imbalanced data. University of California, Berkeley 110 (2004), 1--12.
    [9]
    Sheng Chen and Haibo He. 2009. Sera: selectively recursive approach towards nonstationary imbalanced stream data mining. In 2009 International Joint Conference on Neural Networks. IEEE, 522--529.
    [10]
    Davide Chicco. 2017. Ten quick tips for machine learning in computational biology. BioData mining 10, 1 (2017), 35.
    [11]
    Andrea Dal Pozzolo, Reid Johnson, Olivier Caelen, Serge Waterschoot, Nitesh V Chawla, and Gianluca Bontempi. 2014. Using HDDT to avoid instances propagation in unbalanced and evolving data streams. In 2014 International Joint Conference on Neural Networks (IJCNN). IEEE, 588--594.
    [12]
    Jesse Davis and Mark Goadrich. 2006. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning. ACM, 233--240.
    [13]
    Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research 7, Jan (2006), 1--30.
    [14]
    Gregory Ditzler and Robi Polikar. 2010. An ensemble based incremental learning framework for concept drift and class imbalance. In The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.
    [15]
    Gregory Ditzler and Robi Polikar. 2013. Incremental learning of concept drift from streaming imbalanced data. IEEE Transactions on Knowledge and Data Engineering 25, 10 (2013), 2283--2301.
    [16]
    Pedro Domingos. 1999. Metacost: A general method for making classifiers cost-sensitive. In KDD, Vol. 99. 155--164.
    [17]
    Pedro Domingos and Geoff Hulten. 2000. Mining high-speed data streams. In Kdd, Vol. 2. 4.
    [18]
    Charles Elkan. 2001. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, Vol. 17. Lawrence Erlbaum Associates Ltd, 973--978.
    [19]
    Ryan Elwell and Robi Polikar. 2009. Incremental learning of variable rate concept drift. In International Workshop on Multiple Classifier Systems. Springer, 142--151.
    [20]
    João Gama, Raquel Sebastião, and Pedro Pereira Rodrigues. 2009. Issues in evaluation of stream learning algorithms. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 329--338.
    [21]
    João Gama, Raquel Sebastião, and Pedro Pereira Rodrigues. 2013. On evaluating stream learning algorithms. Machine learning 90, 3 (2013), 317--346.
    [22]
    Xingyu Gao, Zhenyu Chen, Sheng Tang, Yongdong Zhang, and Jintao Li. 2016. Adaptive weighted imbalance learning with application to abnormal activity recognition. Neurocomputing 173 (2016), 1927--1935.
    [23]
    Adel Ghazikhani, Reza Monsefi, and Hadi Sadoghi Yazdi. 2013. Online cost-sensitive neural network classifiers for non-stationary and imbalanced data streams. Neural computing and applications 23, 5 (2013), 1283--1295.
    [24]
    Adel Ghazikhani, Reza Monsefi, and Hadi Sadoghi Yazdi. 2014. Online neural network model for non-stationary and imbalanced data stream classification. International Journal of Machine Learning and Cybernetics 5, 1 (2014), 51--62.
    [25]
    Heitor Murilo Gomes, Jean Paul Barddal, Fabrício Enembreck, and Albert Bifet. 2017. A survey on ensemble learning for data stream classification. ACM Computing Surveys (CSUR) 50, 2 (2017), 23.
    [26]
    Heitor M Gomes, Albert Bifet, Jesse Read, Jean Paul Barddal, Fabrício Enembreck, Bernhard Pfharinger, Geoff Holmes, and Talel Abdessalem. 2017. Adaptive random forests for evolving data stream classification. Machine Learning 106, 9--10 (2017), 1469--1495.
    [27]
    Michael Harries and New South Wales. 1999. Splice-2 comparative evaluation: Electricity pricing. (1999).
    [28]
    Bartosz Krawczyk, Mikel Galar, Łukasz Jeleń, and Francisco Herrera. 2016. Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Applied Soft Computing 38 (2016), 714--726.
    [29]
    Bartosz Krawczyk, Michal Wozniak, and Gerald Schaefer. 2011. Improving minority class prediction using cost-sensitive ensembles. In 16th Online World Conference on Soft Computing in Industrial Applications.
    [30]
    Bartosz Krawczyk, Michał Woźniak, and Gerald Schaefer. 2014. Cost-sensitive decision tree ensembles for effective imbalanced classification. Applied Soft Computing 14 (2014), 554--562.
    [31]
    Ryan N Lichtenwalter and Nitesh V Chawla. 2009. Adaptive methods for classification in arbitrarily imbalanced and drifting data streams. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 53--75.
    [32]
    Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2009. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2 (2009), 539--550.
    [33]
    Brian W Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405, 2 (1975), 442--451.
    [34]
    Nikunj C Oza. 2005. Online bagging and boosting. In 2005 IEEE international conference on systems, man and cybernetics, Vol. 3. Ieee, 2340--2345.
    [35]
    Stjepan Picek, Annelie Heuser, Alan Jovic, Shivam Bhasin, and Francesco Regazzoni. 2018. The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations. IACR Transactions on Cryptographic Hardware and Embedded Systems (2018).
    [36]
    Robert E Schapire. 1999. A brief introduction to boosting. In Ijcai, Vol. 99. 1401--1406.
    [37]
    W Nick Street and YongSeog Kim. 2001. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 377--382.
    [38]
    Konstantinos Topouzelis. 2008. Oil spill detection by SAR images: dark formation detection, feature extraction and classification algorithms. Sensors 8, 10 (2008), 6642--6659.
    [39]
    Jason Van Hulse, Taghi M Khoshgoftaar, and Amri Napolitano. 2007. Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on Machine learning. ACM, 935--942.
    [40]
    Bianca Zadrozny, John Langford, and Naoki Abe. 2003. Cost-Sensitive Learning by Cost-Proportionate Example Weighting. In ICDM, Vol. 3. 435.
    [41]
    Chong Zhang, Kay Chen Tan, and Ruoxu Ren. 2016. Training cost-sensitive deep belief networks on imbalance data problems. In 2016 international joint conference on neural networks (IJCNN). IEEE, 4362--4367.

    Cited By

    View all
    • (2024)EMRIL: Ensemble Method based on ReInforcement Learning for binary classification in imbalanced drifting data streamsNeurocomputing10.1016/j.neucom.2024.128259605(128259)Online publication date: Nov-2024
    • (2024)Cost-sensitive continuous ensemble kernel learning for imbalanced data streams with concept driftKnowledge-Based Systems10.1016/j.knosys.2023.111272284(111272)Online publication date: Jan-2024
    • (2024)Bin.INI: An ensemble approach for dynamic data streamsExpert Systems with Applications10.1016/j.eswa.2024.124853256(124853)Online publication date: Dec-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SAC '20: Proceedings of the 35th Annual ACM Symposium on Applied Computing
    March 2020
    2348 pages
    ISBN:9781450368667
    DOI:10.1145/3341105
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 March 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. adaptive random forest
    2. cost-sensitive
    3. data stream
    4. ensemble
    5. imbalanced datasets

    Qualifiers

    • Research-article

    Funding Sources

    • Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES)

    Conference

    SAC '20
    Sponsor:
    SAC '20: The 35th ACM/SIGAPP Symposium on Applied Computing
    March 30 - April 3, 2020
    Brno, Czech Republic

    Acceptance Rates

    Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)54
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)EMRIL: Ensemble Method based on ReInforcement Learning for binary classification in imbalanced drifting data streamsNeurocomputing10.1016/j.neucom.2024.128259605(128259)Online publication date: Nov-2024
    • (2024)Cost-sensitive continuous ensemble kernel learning for imbalanced data streams with concept driftKnowledge-Based Systems10.1016/j.knosys.2023.111272284(111272)Online publication date: Jan-2024
    • (2024)Bin.INI: An ensemble approach for dynamic data streamsExpert Systems with Applications10.1016/j.eswa.2024.124853256(124853)Online publication date: Dec-2024
    • (2024)Smoclust: synthetic minority oversampling based on stream clustering for evolving data streamsMachine Language10.1007/s10994-023-06420-y113:7(4671-4721)Online publication date: 1-Jul-2024
    • (2023)Online harmonizing gradient descent for imbalanced data streams one-pass classificationProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/274(2468-2475)Online publication date: 19-Aug-2023
    • (2023)Online Learning From Incomplete and Imbalanced Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.325047235:10(10650-10665)Online publication date: 1-Oct-2023
    • (2023)Efficient Prequential AUC-PR Computation2023 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA58977.2023.00335(2222-2227)Online publication date: 15-Dec-2023
    • (2023)Pro-IDD: Pareto-based ensemble for imbalanced and drifting data streamsKnowledge-Based Systems10.1016/j.knosys.2023.111103282(111103)Online publication date: Dec-2023
    • (2023)Cluster based active learning for classification of evolving streamsEvolutionary Intelligence10.1007/s12065-023-00879-317:4(2167-2191)Online publication date: 15-Sep-2023
    • (2023)An improved lightweight and real-time YOLOv5 network for detection of surface defects on indocalamus leavesJournal of Real-Time Image Processing10.1007/s11554-023-01281-z20:1Online publication date: 9-Feb-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media