Article

Mining needle in a haystack: classifying rare classes via two-phase rule induction

Authors:

Mahesh V. Joshi,

Ramesh C. Agarwal,

Vipin KumarAuthors Info & Claims

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data

Pages 91 - 102

https://doi.org/10.1145/375663.375673

Published: 01 May 2001 Publication History

Abstract

Learning models to classify rarely occurring target classes is an important problem with applications in network intrusion detection, fraud detection, or deviation detection in general. In this paper, we analyze our previously proposed two-phase rule induction method in the context of learning complete and precise signatures of rare classes. The key feature of our method is that it separately conquers the objectives of achieving high recall and high precision for the given target class. The first phase of the method aims for high recall by inducing rules with high support and a reasonable level of accuracy. The second phase then tries to improve the precision by learning rules to remove false positives in the collection of the records covered by the first phase rules. Existing sequential covering techniques try to achieve high precision for each individual disjunct learned. In this paper, we claim that such approach is inadequate for rare classes, because of two problems: splintered false positives and error-prone small disjuncts. Motivated by the strengths of our two-phase design, we design various synthetic data models to identify and analyze the situations in which two state-of-the-art methods, RIPPER and C4.5 rules, either fail to learn a model or learn a very poor model. In all these situations, our two-phase approach learns a model with significantly better recall and precision levels. We also present a comparison of the three methods on a challenging real-life network intrusion detection dataset. Our method is significantly better or comparable to the best competitor in terms of achieving better balance between recall and precision.

References

[1]

{1} R. C. Agarwal and M. V. Joshi. PNrule: A new framework for learning classifier models in data mining (a case study in network intrusion detection). In Proceedings of First SIAM Conference on Data Mining, Chicago, April 2001. Expanded version available as IBM Research Division Report, RC 21719, April 2000.

[2]

{2} L. Breiman. Bagging predictors. Machine Learning, 24(2):123-140, 1996.

[3]

{3} P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3:261-283, 1989.

Digital Library

[4]

{4} W. Cohen and Y. Singer. A simple, fast, and effective rule learner. In Proc. of Annual Conference of American Association for Artificial Intelligence, pages 335-342, 1999.

Digital Library

[5]

{5} W. W. Cohen. Fast effective rule induction. In Proc. of Twelfth International Conference on Machine Learning, Lake Tahoe, California, 1995.

Digital Library

[6]

{6} A. Danyluk and F. Provost, Small disjuncts in action: Learning to diagnose errors in the local loop of the telephone network. In Proc. of Tenth International Conference on Machine Learning, pages 81-88. Morgan Kaufmann, 1993.

[7]

{7} C. Elkan. Results of the KDD'99 classifier learning contest. In http:///www-cse.ucsd.edu/~elkan/clresults.html, September 1999.

[8]

{8} R. C. Holte, L. Acker, and B. Porter. Concept learning and the problem of small disjuncts. In Proc. of Eleventh International Joint Conference on Artificial Intelligence (IJCAI-89), Pages 813-818, 1989.

[9]

{9} R. Michalski, I. Mozetic, J. Hong, and N. Lavrac. The multi-purpose incremental learning system AQ15 and its testing application to three medical domains. In Proc. of Fifth National Conference on AI (AAAI-86). pages 1041-1045, Philadelphia, 1986.

[10]

{10} T. Mitchell. Machine Learning. McGraw Hill, 1997.

Digital Library

[11]

{11} J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

Digital Library

[12]

{12} R. Shapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297-336, 1999.

Digital Library

[13]

{13} Y.-S Shih. Family of splitting criteria for classification trees. Statistics and Computing, 9:309-315, 1999.

Digital Library

[14]

{14} C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.

Digital Library

[15]

{15} G. M. Weiss. Learning with rare cases and small disjuncts. In Proc. of Twelfth International Conference on Machine Learning, pages 558-565, Lake Tahoe, California, 1995.

Digital Library

[16]

{16} S. M. Weiss and N. Indurkhya. Lightweight rule induction. In Proc. of Seventh International Conference on Machine Learning (ICML-2000), 2000.

Digital Library

Cited By

Fardin SHashemzadeh M(2023)Outlier Detection on Data Streams Using a QLattice-based Model and Online LearningSignal and Data Processing10.61186/jsdp.20.2.8120:2(81-98)Online publication date: 1-Sep-2023
https://doi.org/10.61186/jsdp.20.2.81
Marjai PLehotay-Kery PKiss A(2022)A Heuristic Approach Using Template Miners for Error Prediction in Telecommunication NetworksIEEE Access10.1109/ACCESS.2022.322142710(118953-118964)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3221427
Alghushairy OAlsini RSoule TMa X(2020)A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data StreamsBig Data and Cognitive Computing10.3390/bdcc50100015:1(1)Online publication date: 29-Dec-2020
https://doi.org/10.3390/bdcc5010001
Show More Cited By

Index Terms

Mining needle in a haystack: classifying rare classes via two-phase rule induction
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing
  2. Information systems applications
    1. Data mining

Recommendations

Mining needle in a haystack: classifying rare classes via two-phase rule induction

Learning models to classify rarely occurring target classes is an important problem with applications in network intrusion detection, fraud detection, or deviation detection in general. In this paper, we analyze our previously proposed two-phase rule ...
Needle in a Haystack: Generating Audit Hypotheses for Clinical Audits of Hospitals
Abstract
The purpose of the research is to identify actionable audit hypotheses, i.e., potential abnormality with respect to drugs prescribed, procedures undertaken or lab tests performed from clinical audit perspective. Structural and novel graphical ...
Mining for Interesting Action Rules
IAT '05: Proceedings of the IEEE/WIC/ACM International Conference on Intelligent Agent Technology

There are two aspects of interestingness of rules, objective and subjective measures ([7], [1], [15], [16]). Objective measures are data-driven and domain-lly, they evaluatethe rules based on their quality and similarity between them.Subjective measures ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data

May 2001

630 pages

ISBN:1581133324

DOI:10.1145/375663

ACM SIGMOD Record Volume 30, Issue 2
June 2001
625 pages
ISSN:0163-5808
DOI:10.1145/376284
Editors:
Timos Sellis
National Technical Univ. of Athens
,
Sharad Mehrotra
Univ. of California at Irvine
Issue’s Table of Contents

Copyright © 2001 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2001

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGMOD/PODS01

Sponsor:

SIGMOD

SIGMOD/PODS01: ACM SIGMOD International Conference on Management of Data

May 21 - 24, 2001

California, Santa Barbara, USA

Acceptance Rates

SIGMOD '01 Paper Acceptance Rate 44 of 293 submissions, 15%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

79
Total Citations
View Citations
1,170
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fardin SHashemzadeh M(2023)Outlier Detection on Data Streams Using a QLattice-based Model and Online LearningSignal and Data Processing10.61186/jsdp.20.2.8120:2(81-98)Online publication date: 1-Sep-2023
https://doi.org/10.61186/jsdp.20.2.81
Marjai PLehotay-Kery PKiss A(2022)A Heuristic Approach Using Template Miners for Error Prediction in Telecommunication NetworksIEEE Access10.1109/ACCESS.2022.322142710(118953-118964)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3221427
Alghushairy OAlsini RSoule TMa X(2020)A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data StreamsBig Data and Cognitive Computing10.3390/bdcc50100015:1(1)Online publication date: 29-Dec-2020
https://doi.org/10.3390/bdcc5010001
Huang HYan QLu WLin HGao YChen L(2020)LERI: Local Exploration for Rare-Category IdentificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.2911941(1-1)Online publication date: 2020
https://doi.org/10.1109/TKDE.2019.2911941
(2019)Logistic regression for imbalanced learning based on clusteringInternational Journal of Computational Science and Engineering10.5555/3302674.330268118:1(54-64)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.5555/3302674.3302681
Datta SMengel S(2018)Adaptable multi-phase rules over the infrequent classSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-018-3399-z22:18(6067-6076)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1007/s00500-018-3399-z
Datta SMengel S(2016)Elastic Multi-stage Decision Rules for Infrequent Class2016 3rd International Conference on Soft Computing & Machine Intelligence (ISCMI)10.1109/ISCMI.2016.20(110-114)Online publication date: Nov-2016
https://doi.org/10.1109/ISCMI.2016.20
Santafe GInza ILozano J(2015)Dealing with the evaluation of supervised classification algorithmsArtificial Intelligence Review10.1007/s10462-015-9433-y44:4(467-508)Online publication date: 1-Dec-2015
https://dl.acm.org/doi/10.1007/s10462-015-9433-y
Meng YKwok L(2014)Enhancing Intrusion Detection Systems Using Intelligent False Alarm FilterArchitectures and Protocols for Secure Information Technology Infrastructures10.4018/978-1-4666-4514-1.ch008(214-236)Online publication date: 2014
https://doi.org/10.4018/978-1-4666-4514-1.ch008
Bhuyan MBhattacharyya DKalita J(2014)Network Anomaly Detection: Methods, Systems and ToolsIEEE Communications Surveys & Tutorials10.1109/SURV.2013.052213.0004616:1(303-336)Online publication date: Sep-2015
https://doi.org/10.1109/SURV.2013.052213.00046
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents