Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/375663.375673acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Mining needle in a haystack: classifying rare classes via two-phase rule induction

Published: 01 May 2001 Publication History

Abstract

Learning models to classify rarely occurring target classes is an important problem with applications in network intrusion detection, fraud detection, or deviation detection in general. In this paper, we analyze our previously proposed two-phase rule induction method in the context of learning complete and precise signatures of rare classes. The key feature of our method is that it separately conquers the objectives of achieving high recall and high precision for the given target class. The first phase of the method aims for high recall by inducing rules with high support and a reasonable level of accuracy. The second phase then tries to improve the precision by learning rules to remove false positives in the collection of the records covered by the first phase rules. Existing sequential covering techniques try to achieve high precision for each individual disjunct learned. In this paper, we claim that such approach is inadequate for rare classes, because of two problems: splintered false positives and error-prone small disjuncts. Motivated by the strengths of our two-phase design, we design various synthetic data models to identify and analyze the situations in which two state-of-the-art methods, RIPPER and C4.5 rules, either fail to learn a model or learn a very poor model. In all these situations, our two-phase approach learns a model with significantly better recall and precision levels. We also present a comparison of the three methods on a challenging real-life network intrusion detection dataset. Our method is significantly better or comparable to the best competitor in terms of achieving better balance between recall and precision.

References

[1]
{1} R. C. Agarwal and M. V. Joshi. PNrule: A new framework for learning classifier models in data mining (a case study in network intrusion detection). In Proceedings of First SIAM Conference on Data Mining, Chicago, April 2001. Expanded version available as IBM Research Division Report, RC 21719, April 2000.
[2]
{2} L. Breiman. Bagging predictors. Machine Learning, 24(2):123-140, 1996.
[3]
{3} P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3:261-283, 1989.
[4]
{4} W. Cohen and Y. Singer. A simple, fast, and effective rule learner. In Proc. of Annual Conference of American Association for Artificial Intelligence, pages 335-342, 1999.
[5]
{5} W. W. Cohen. Fast effective rule induction. In Proc. of Twelfth International Conference on Machine Learning, Lake Tahoe, California, 1995.
[6]
{6} A. Danyluk and F. Provost, Small disjuncts in action: Learning to diagnose errors in the local loop of the telephone network. In Proc. of Tenth International Conference on Machine Learning, pages 81-88. Morgan Kaufmann, 1993.
[7]
{7} C. Elkan. Results of the KDD'99 classifier learning contest. In http:///www-cse.ucsd.edu/~elkan/clresults.html, September 1999.
[8]
{8} R. C. Holte, L. Acker, and B. Porter. Concept learning and the problem of small disjuncts. In Proc. of Eleventh International Joint Conference on Artificial Intelligence (IJCAI-89), Pages 813-818, 1989.
[9]
{9} R. Michalski, I. Mozetic, J. Hong, and N. Lavrac. The multi-purpose incremental learning system AQ15 and its testing application to three medical domains. In Proc. of Fifth National Conference on AI (AAAI-86). pages 1041-1045, Philadelphia, 1986.
[10]
{10} T. Mitchell. Machine Learning. McGraw Hill, 1997.
[11]
{11} J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[12]
{12} R. Shapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297-336, 1999.
[13]
{13} Y.-S Shih. Family of splitting criteria for classification trees. Statistics and Computing, 9:309-315, 1999.
[14]
{14} C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.
[15]
{15} G. M. Weiss. Learning with rare cases and small disjuncts. In Proc. of Twelfth International Conference on Machine Learning, pages 558-565, Lake Tahoe, California, 1995.
[16]
{16} S. M. Weiss and N. Indurkhya. Lightweight rule induction. In Proc. of Seventh International Conference on Machine Learning (ICML-2000), 2000.

Cited By

View all
  • (2023)Outlier Detection on Data Streams Using a QLattice-based Model and Online LearningSignal and Data Processing10.61186/jsdp.20.2.8120:2(81-98)Online publication date: 1-Sep-2023
  • (2022)A Heuristic Approach Using Template Miners for Error Prediction in Telecommunication NetworksIEEE Access10.1109/ACCESS.2022.322142710(118953-118964)Online publication date: 2022
  • (2020)A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data StreamsBig Data and Cognitive Computing10.3390/bdcc50100015:1(1)Online publication date: 29-Dec-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
May 2001
630 pages
ISBN:1581133324
DOI:10.1145/375663
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2001

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS01
Sponsor:

Acceptance Rates

SIGMOD '01 Paper Acceptance Rate 44 of 293 submissions, 15%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)1
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Outlier Detection on Data Streams Using a QLattice-based Model and Online LearningSignal and Data Processing10.61186/jsdp.20.2.8120:2(81-98)Online publication date: 1-Sep-2023
  • (2022)A Heuristic Approach Using Template Miners for Error Prediction in Telecommunication NetworksIEEE Access10.1109/ACCESS.2022.322142710(118953-118964)Online publication date: 2022
  • (2020)A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data StreamsBig Data and Cognitive Computing10.3390/bdcc50100015:1(1)Online publication date: 29-Dec-2020
  • (2020)LERI: Local Exploration for Rare-Category IdentificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.2911941(1-1)Online publication date: 2020
  • (2019)Logistic regression for imbalanced learning based on clusteringInternational Journal of Computational Science and Engineering10.5555/3302674.330268118:1(54-64)Online publication date: 1-Jan-2019
  • (2018)Adaptable multi-phase rules over the infrequent classSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-018-3399-z22:18(6067-6076)Online publication date: 1-Sep-2018
  • (2016)Elastic Multi-stage Decision Rules for Infrequent Class2016 3rd International Conference on Soft Computing & Machine Intelligence (ISCMI)10.1109/ISCMI.2016.20(110-114)Online publication date: Nov-2016
  • (2015)Dealing with the evaluation of supervised classification algorithmsArtificial Intelligence Review10.1007/s10462-015-9433-y44:4(467-508)Online publication date: 1-Dec-2015
  • (2014)Enhancing Intrusion Detection Systems Using Intelligent False Alarm FilterArchitectures and Protocols for Secure Information Technology Infrastructures10.4018/978-1-4666-4514-1.ch008(214-236)Online publication date: 2014
  • (2014)Network Anomaly Detection: Methods, Systems and ToolsIEEE Communications Surveys & Tutorials10.1109/SURV.2013.052213.0004616:1(303-336)Online publication date: Sep-2015
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media