Article

Predicting rare classes: can boosting make any weak learner strong?

Authors:

Mahesh V. Joshi,

Ramesh C. Agarwal,

Vipin KumarAuthors Info & Claims

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 297 - 306

https://doi.org/10.1145/775047.775092

Published: 23 July 2002 Publication History

Get Access

Abstract

Boosting is a strong ensemble-based learning algorithm with the promise of iteratively improving the classification accuracy using any base learner, as long as it satisfies the condition of yielding weighted accuracy > 0.5. In this paper, we analyze boosting with respect to this basic condition on the base learner, to see if boosting ensures prediction of rarely occurring events with high recall and precision. First we show that a base learner can satisfy the required condition even for poor recall or precision levels, especially for very rare classes. Furthermore, we show that the intelligent weight updating mechanism in boosting, even in its strong cost-sensitive form, does not prevent cases where the base learner always achieves high precision but poor recall or high recall but poor precision, when mapped to the original distribution. In either of these cases, we show that the voting mechanism of boosting falls to achieve good overall recall and precision for the ensemble. In effect, our analysis indicates that one cannot be blind to the base learner performance, and just rely on the boosting mechanism to take care of its weakness. We validate our arguments empirically on variety of real and synthetic rare class problems. In particular, using AdaCost as the boosting algorithm, and variations of PNrule and RIPPER as the base learners, we show that if algorithm A achieves better recall-precision balance than algorithm B, then using A as the base learner in AdaCost yields significantly better performance than using B as the base learner.

References

[1]

C. Blake and C. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/~mlearn/MLRepository.html.

Google Scholar

[2]

L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801--849, 1998.

Crossref

Google Scholar

[3]

P. Chan and S. Stolfo. Towards scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In Proc. of Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), pages 164--168, New York City, 1998.

Google Scholar

[4]

W. Cohen and Y. Singer. A simple, fast, and effective rule learner. In Proc. of Annual Conference of American Association for Artificial Intelligence, pages 335--342, 1999.

Digital Library

Google Scholar

[5]

W. W. Cohen. Fast effective rule induction. In Proc. of Twelfth International Conference on Machine Learning, Lake Tahoe, California, 1995.

Digital Library

Google Scholar

[6]

W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. AdaCost: Misclassification cost-sensitive boosting. In Proc. of Sixth International Conference on Machine Learning (ICML-99), Bled, Slovenia, 1999.

Digital Library

Google Scholar

[7]

W. Hersh, C. Buckley, T. Leone, and D. Hickam. Ohsumed: An interactive retrieval evaluation and new large test collection for research. In In Proceedings of the 17th Annual ACM SIGIR Conference, pages 192--201, 1994.

Digital Library

Google Scholar

[8]

R. C. Holte, N. Japkowicz, C. X. Ling, and S. M. (eds.). Learning from imbalanced data sets (papers from aaai workshop). Technical Report WS-00-05, AAAI Press, Menlo Park, CA, 2000.

Google Scholar

[9]

M. V. Joshi, R. C. Agarwal, and V. Kumar. Mining needles in a haystack: Classifying rare classes via two-phase rule induction. In Proc. of ACM SIGMOD Conference, pages 91--102, Santa Barbara, CA, 2001.

Digital Library

Google Scholar

[10]

M. V. Joshi, V. Kumar, and R. C. Agarwal. Evaluating boosting algorithms to classify rare classes: Comparison and improvements. In Proc. of The First IEEE International Conference on Data Mining (ICDM), San Jose, CA, Nov 2001.

Digital Library

Google Scholar

[11]

M. J. Kearns and U. V. Vazirani. An Introduction to Computational Learning Theory. The MIT Press, 1994.

Digital Library

Google Scholar

[12]

R. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297--336, 1999.

Digital Library

Google Scholar

[13]

K. M. Ting. A comparative study of cost-sensitive boosting algorithms. In Proc. of 17th International Conf. on Machine Learning, pages 983--990, Stanford University, CA, 2000.

Digital Library

Google Scholar

[14]

C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.

Digital Library

Google Scholar

Cited By

View all

Wang YZhang JYuan JLi QZhang SWang CWang HWang LZhang BWang CSun YLu X(2024)Application of a novel nested ensemble algorithm in predicting motor function recovery in patients with traumatic cervical spinal cord injuryScientific Reports10.1038/s41598-024-65755-114:1Online publication date: 29-Jul-2024
https://doi.org/10.1038/s41598-024-65755-1
Hu SHuang Y(2024)TOPOMA: Time-Series Orthogonal Projection Operator with Moving Average for Interpretable and Training-Free Anomaly DetectionAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2242-6_5(53-65)Online publication date: 25-Apr-2024
https://doi.org/10.1007/978-981-97-2242-6_5
Ecik EJohn WWithöft JGötze J(2023)Anomaly Detection with Decision Trees for AI Assisted Evaluation of Signal Integrity on PCB Transmission LinesAdvances in Radio Science10.5194/ars-21-37-202321(37-48)Online publication date: 1-Dec-2023
https://doi.org/10.5194/ars-21-37-2023
Show More Cited By

Index Terms

Predicting rare classes: can boosting make any weak learner strong?

Recommendations

Predicting Rare Classes: Comparing Two-Phase Rule Induction to Cost-Sensitive Boosting
PKDD '02: Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery

Learning good classifier models of rare events is a challenging task. On such problems, the recently proposed two-phase rule induction algorithm, PNrule, outperforms other non-meta methods of rule induction. Boosting is a strong meta-classifier approach,...
Learning classifier models for predicting rare phenomena
Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements
ICDM '01: Proceedings of the 2001 IEEE International Conference on Data Mining

Classification of rare vents has many important data mining applications. Boosting is a promising meta-techniquethat improves the classification performance of any weak classifier. So far, no systematic study has been conducted to evaluate how boosting ...

Comments

Information & Contributors

Information

Published In

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

July 2002

719 pages

ISBN:158113567X

DOI:10.1145/775047

Conference Chair:
Osmar R. Zaïane
University of Alberta, Canada
,
General Chair:
Randy Goebel
University of Alberta, Canada
,
Program Chairs:
David Hand
Imperial College, UK
,
Daniel Keim
AT&T
,
Raymond Ng
University of British Columbia, Canada

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

KDD02

Sponsor:

KDD02: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

July 23 - 26, 2002

Alberta, Edmonton, Canada

Acceptance Rates

KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '24

Sponsor:
sigkdd
sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

59
Total Citations
View Citations
628
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)2

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wang YZhang JYuan JLi QZhang SWang CWang HWang LZhang BWang CSun YLu X(2024)Application of a novel nested ensemble algorithm in predicting motor function recovery in patients with traumatic cervical spinal cord injuryScientific Reports10.1038/s41598-024-65755-114:1Online publication date: 29-Jul-2024
https://doi.org/10.1038/s41598-024-65755-1
Hu SHuang Y(2024)TOPOMA: Time-Series Orthogonal Projection Operator with Moving Average for Interpretable and Training-Free Anomaly DetectionAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2242-6_5(53-65)Online publication date: 25-Apr-2024
https://doi.org/10.1007/978-981-97-2242-6_5
Ecik EJohn WWithöft JGötze J(2023)Anomaly Detection with Decision Trees for AI Assisted Evaluation of Signal Integrity on PCB Transmission LinesAdvances in Radio Science10.5194/ars-21-37-202321(37-48)Online publication date: 1-Dec-2023
https://doi.org/10.5194/ars-21-37-2023
Gupta MKumar RBose AGupta P(2023)Oil and Gas Reservoir Characterization in Predicting Missing Velocity Logs Using Machine Learning2023 Global Conference on Information Technologies and Communications (GCITC)10.1109/GCITC60406.2023.10426582(1-5)Online publication date: 1-Dec-2023
https://doi.org/10.1109/GCITC60406.2023.10426582
Marjai PLehotay-Kery PKiss A(2022)A Heuristic Approach Using Template Miners for Error Prediction in Telecommunication NetworksIEEE Access10.1109/ACCESS.2022.322142710(118953-118964)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3221427
Adhya DChatterjee SChakraborty A(2022)Diagnosis of PV Array Faults Using RUSBoostJournal of Control, Automation and Electrical Systems10.1007/s40313-022-00947-634:1(157-165)Online publication date: 14-Sep-2022
https://doi.org/10.1007/s40313-022-00947-6
Hayashi TOhsawa YHayashi TOhsawa Y(2022)Tools for Activating Data Marketplace (1)Tools for Activating Data Marketplace10.1007/978-3-031-06145-5_3(55-83)Online publication date: 1-Dec-2022
https://doi.org/10.1007/978-3-031-06145-5_3
Lin WQue LLin GChen RLu QZhicheng Du MHui Liu MYu ZHuang M(2021)Using Machine Learning to Predict Five-Year Reintervention Risk in Type B Aortic Dissection Patients After Thoracic Endovascular Aortic RepairJournal of Medical Imaging and Health Informatics10.1166/jmihi.2021.381311:6(1560-1567)Online publication date: 1-Jun-2021
https://doi.org/10.1166/jmihi.2021.3813
Du JWu FXing RGong XYu L(2021)Segmentation and sampling method for complex polyline generalization based on a generative adversarial networkGeocarto International10.1080/10106049.2021.187828837:14(4158-4180)Online publication date: 9-Feb-2021
https://doi.org/10.1080/10106049.2021.1878288
Hong JQin CHuang YZhou Y(2021)Towards Making More Reliable Cardiotocogram Data Prediction with Limited Expert Knowledge: Exploiting Unlabeled Data with Semi-supervised Boosting MethodData Mining and Big Data10.1007/978-981-16-7476-1_37(422-435)Online publication date: 31-Oct-2021
https://doi.org/10.1007/978-981-16-7476-1_37
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Predicting Rare Classes: Comparing Two-Phase Rule Induction to Cost-Sensitive Boosting

Learning classifier models for predicting rare phenomena

Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements