Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/775047.775092acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Predicting rare classes: can boosting make any weak learner strong?

Published: 23 July 2002 Publication History
  • Get Citation Alerts
  • Abstract

    Boosting is a strong ensemble-based learning algorithm with the promise of iteratively improving the classification accuracy using any base learner, as long as it satisfies the condition of yielding weighted accuracy > 0.5. In this paper, we analyze boosting with respect to this basic condition on the base learner, to see if boosting ensures prediction of rarely occurring events with high recall and precision. First we show that a base learner can satisfy the required condition even for poor recall or precision levels, especially for very rare classes. Furthermore, we show that the intelligent weight updating mechanism in boosting, even in its strong cost-sensitive form, does not prevent cases where the base learner always achieves high precision but poor recall or high recall but poor precision, when mapped to the original distribution. In either of these cases, we show that the voting mechanism of boosting falls to achieve good overall recall and precision for the ensemble. In effect, our analysis indicates that one cannot be blind to the base learner performance, and just rely on the boosting mechanism to take care of its weakness. We validate our arguments empirically on variety of real and synthetic rare class problems. In particular, using AdaCost as the boosting algorithm, and variations of PNrule and RIPPER as the base learners, we show that if algorithm A achieves better recall-precision balance than algorithm B, then using A as the base learner in AdaCost yields significantly better performance than using B as the base learner.

    References

    [1]
    C. Blake and C. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/~mlearn/MLRepository.html.
    [2]
    L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801--849, 1998.
    [3]
    P. Chan and S. Stolfo. Towards scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In Proc. of Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), pages 164--168, New York City, 1998.
    [4]
    W. Cohen and Y. Singer. A simple, fast, and effective rule learner. In Proc. of Annual Conference of American Association for Artificial Intelligence, pages 335--342, 1999.
    [5]
    W. W. Cohen. Fast effective rule induction. In Proc. of Twelfth International Conference on Machine Learning, Lake Tahoe, California, 1995.
    [6]
    W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. AdaCost: Misclassification cost-sensitive boosting. In Proc. of Sixth International Conference on Machine Learning (ICML-99), Bled, Slovenia, 1999.
    [7]
    W. Hersh, C. Buckley, T. Leone, and D. Hickam. Ohsumed: An interactive retrieval evaluation and new large test collection for research. In In Proceedings of the 17th Annual ACM SIGIR Conference, pages 192--201, 1994.
    [8]
    R. C. Holte, N. Japkowicz, C. X. Ling, and S. M. (eds.). Learning from imbalanced data sets (papers from aaai workshop). Technical Report WS-00-05, AAAI Press, Menlo Park, CA, 2000.
    [9]
    M. V. Joshi, R. C. Agarwal, and V. Kumar. Mining needles in a haystack: Classifying rare classes via two-phase rule induction. In Proc. of ACM SIGMOD Conference, pages 91--102, Santa Barbara, CA, 2001.
    [10]
    M. V. Joshi, V. Kumar, and R. C. Agarwal. Evaluating boosting algorithms to classify rare classes: Comparison and improvements. In Proc. of The First IEEE International Conference on Data Mining (ICDM), San Jose, CA, Nov 2001.
    [11]
    M. J. Kearns and U. V. Vazirani. An Introduction to Computational Learning Theory. The MIT Press, 1994.
    [12]
    R. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297--336, 1999.
    [13]
    K. M. Ting. A comparative study of cost-sensitive boosting algorithms. In Proc. of 17th International Conf. on Machine Learning, pages 983--990, Stanford University, CA, 2000.
    [14]
    C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.

    Cited By

    View all
    • (2024)Application of a novel nested ensemble algorithm in predicting motor function recovery in patients with traumatic cervical spinal cord injuryScientific Reports10.1038/s41598-024-65755-114:1Online publication date: 29-Jul-2024
    • (2024)TOPOMA: Time-Series Orthogonal Projection Operator with Moving Average for Interpretable and Training-Free Anomaly DetectionAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2242-6_5(53-65)Online publication date: 25-Apr-2024
    • (2023)Anomaly Detection with Decision Trees for AI Assisted Evaluation of Signal Integrity on PCB Transmission LinesAdvances in Radio Science10.5194/ars-21-37-202321(37-48)Online publication date: 1-Dec-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
    July 2002
    719 pages
    ISBN:158113567X
    DOI:10.1145/775047
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 July 2002

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Conference

    KDD02
    Sponsor:

    Acceptance Rates

    KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)26
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Application of a novel nested ensemble algorithm in predicting motor function recovery in patients with traumatic cervical spinal cord injuryScientific Reports10.1038/s41598-024-65755-114:1Online publication date: 29-Jul-2024
    • (2024)TOPOMA: Time-Series Orthogonal Projection Operator with Moving Average for Interpretable and Training-Free Anomaly DetectionAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2242-6_5(53-65)Online publication date: 25-Apr-2024
    • (2023)Anomaly Detection with Decision Trees for AI Assisted Evaluation of Signal Integrity on PCB Transmission LinesAdvances in Radio Science10.5194/ars-21-37-202321(37-48)Online publication date: 1-Dec-2023
    • (2023)Oil and Gas Reservoir Characterization in Predicting Missing Velocity Logs Using Machine Learning2023 Global Conference on Information Technologies and Communications (GCITC)10.1109/GCITC60406.2023.10426582(1-5)Online publication date: 1-Dec-2023
    • (2022)A Heuristic Approach Using Template Miners for Error Prediction in Telecommunication NetworksIEEE Access10.1109/ACCESS.2022.322142710(118953-118964)Online publication date: 2022
    • (2022)Diagnosis of PV Array Faults Using RUSBoostJournal of Control, Automation and Electrical Systems10.1007/s40313-022-00947-634:1(157-165)Online publication date: 14-Sep-2022
    • (2022)Tools for Activating Data Marketplace (1)Tools for Activating Data Marketplace10.1007/978-3-031-06145-5_3(55-83)Online publication date: 1-Dec-2022
    • (2021)Using Machine Learning to Predict Five-Year Reintervention Risk in Type B Aortic Dissection Patients After Thoracic Endovascular Aortic RepairJournal of Medical Imaging and Health Informatics10.1166/jmihi.2021.381311:6(1560-1567)Online publication date: 1-Jun-2021
    • (2021)Segmentation and sampling method for complex polyline generalization based on a generative adversarial networkGeocarto International10.1080/10106049.2021.187828837:14(4158-4180)Online publication date: 9-Feb-2021
    • (2021)Towards Making More Reliable Cardiotocogram Data Prediction with Limited Expert Knowledge: Exploiting Unlabeled Data with Semi-supervised Boosting MethodData Mining and Big Data10.1007/978-981-16-7476-1_37(422-435)Online publication date: 31-Oct-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media