Article

Discovering outlier filtering rules from unlabeled data: combining a supervised learner with an unsupervised learner

Authors:

Kenji Yamanishi,

Jun-ichi TakeuchiAuthors Info & Claims

KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 389 - 394

https://doi.org/10.1145/502512.502570

Published: 26 August 2001 Publication History

Abstract

This paper is concerned with the problem of detecting outliers from unlabeled data. In prior work we have developed SmartSifter, which is an on-line outlier detection algorithm based on unsupervised learning from data. On the basis of SmartSifter this paper yields a new framework for outlier filtering using both supervised and unsupervised learning techniques iteratively in order to make the detection process more effective and more understandable. The outline of the framework is as follows: In the first round, for an initial dataset, we run SmartSifter to give each data a score, with a high score indicating a high possibility of being an outlier. Next, giving positive labels to a number of higher scored data and negative labels to a number of lower scored data, we create labeled examples. Then we construct an outlier filtering rule by supervised learning from them. Here the rule is generated based on the principle of minimizing extended stochastic complexity. In the second round, for a new dataset, we filter the data using the constructed rule, then among the filtered data, we run SmartSifter again to evaluate the data in order to update the filtering rule. Applying of our framework to the network intrusion detection, we demonstrate that 1) it can significantly improve the accuracy of SmartSifter, and 2) outlier filtering rules can help the user to discover a general pattern of an outlier group.

References

[1]

V. Barnett and T. Lewis, Outliers in Statistical Data, John Wiley & Sons, 1994.

[2]

F. Bonchi, F. Giannotti, G. Mainetto, and D. Pedeschi, A classification-based methodology for planning audit strategies in fraud detection, in Proc. of KDD-99, pp:175-184, 1999.

Digital Library

[3]

P. Burge and J. Shawe-Taylor, Detecting cellular fraud using adaptive prototypes, in Proc. of AI Approaches to Fraud Detection and Risk Management, pp:9-13, 1997.

[4]

T. Fawcett and F. Provost, Adaptive fraud detection, Data Mining and Knowledge Discovery, vol.1, Kluwer Academic Publishers, Boston CA, pp:291-316 (1997).

Digital Library

[5]

http://www.hnc.com

[6]

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

[7]

E. M. Knorr and R. T. Ng, Algorithms for mining distance-based outliers in large datasets, in Proc. of the 24th VLDB Conference,pp:392-403, 1998.

Digital Library

[8]

E. M. Knorr and R. T. Ng, Finding intensional knowledge of distance-based outliers, in Proc. of the 2Sth VLDB Conference, pp:211-222, 1999.

Digital Library

[9]

T. Lane and C.E. Brodley, Temporal sequence learning and data reduction for anomaly detection, ACM Trans. on Information and System Security, 2,pp:295-331 (1999).

Digital Library

[10]

W. Lee, S. J. Stolfo, and K. W. Mok, Mining audit data to build intrusion detection models, in Proc. of KDD-98, 1998.

[11]

H. Li and K. Yamanishi, Text classification using ESC-based stochastic decision lists, in Proc. of CIKM'99, pp:122-130 (1999).

Digital Library

[12]

Y. Moreau and J. Vandewalle, Detection of mobile phone fraud using supervised neural networks: a first prototype, Available via: ftp://ftp.esat.kuleuven.ac.jp/pub/SISTA/ moreau/reports/icann97_TR97-44.ps.

[13]

U. Murad and G. Pinkas, Unsupervised profiling for identifying superimposed fraud, in Proc. of PKDD'99, pp:251-261 (1999).

Digital Library

[14]

J. Rissanen, Fisher information and stochastic complexity, IEEE Trans. Inf. Theory, IT-42, 1, pp. 40-47 (1996).

Digital Library

[15]

R. M. Neal and G. E. Hinton, A view of the EM algorithm that justifies incremental, sparse, and other variants, ftp://ftp.cs.toronto.edu/pub/radford/www/publications.html 1993.

[16]

R.L. Rivest, Learning decision lists, Machine Learning, 2, pp:229-246, (1987).

Digital Library

[17]

S. Rosset, U. Murad, E. Neumann, Y. Idan, and G. Pinkas, Discovery of fraud rules for telecommunications-challenges and solutions, in Proe. of KDD-99, pp:409-413, 1999.

Digital Library

[18]

J.Takeuchi and K.Yamanishi, Empirical evaluation of an outlier detection engine SmartSifter, in Proc. of Symposium on Information and Its Applications (in Japanese), 2000.

[19]

K.Yamanishi, A learning criterion for stochastic rules, Machine Learning, Vol.9,pp:165-203 (1992).

Digital Library

[20]

K. Yamanishi, A decision-theoretic extension of stochastic complexity and its application to learning, IEEE Trans. on Inf. Theory, IT-44, pp.1424-1439 (1998).

Digital Library

[21]

K. Yamanishi, J.Takeuchi, G.Williams, and P.Milne, On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms, in Proc. of KDD2000, ACM Press, pp:250-254, (2000).

Digital Library

Cited By

Vishva Gandhi Gajjar T(2024)Enhancing Fraud Detection in Financial Transactions through Cyber Security MeasuresInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT241028110:2(364-371)Online publication date: 20-Apr-2024
https://doi.org/10.32628/CSEIT2410281
Yamanishi KYamanishi K(2023)Parameter EstimationLearning with the Minimum Description Length Principle10.1007/978-981-99-1790-7_2(47-90)Online publication date: 15-Sep-2023
https://doi.org/10.1007/978-981-99-1790-7_2
Prusti DBehera RRath S(2022)Hybridizing graph‐based Gaussian mixture model with machine learning for classification of fraudulent transactionsComputational Intelligence10.1111/coin.1256138:6(2134-2160)Online publication date: 26-Nov-2022
https://doi.org/10.1111/coin.12561
Show More Cited By

Index Terms

Discovering outlier filtering rules from unlabeled data: combining a supervised learner with an unsupervised learner

Recommendations

Few-Shot Learning with Unlabeled Outlier Exposure
MultiMedia Modeling
Abstract
Few-shot learning aims to train a classifier which can recognize a new class from a few examples like a human. Recently, some works have leveraged auxiliary information in few-shot learning, such as textual data, unlabeled visual data. But these ...
Learning with unlabeled data
Partial label learning with unlabeled data
IJCAI'19: Proceedings of the 28th International Joint Conference on Artificial Intelligence

Partial label learning deals with training examples each associated with a set of candidate labels, among which only one label is valid. Previous studies typically assume that the candidate label sets are provided for all training examples. In many real-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

August 2001

493 pages

ISBN:158113391X

DOI:10.1145/502512

Conference Chair:
Doheon Lee
Chonnam National University, Korea
,
General Chair:
Mario Schkolnick
SGI
,
Program Chairs:
Foster Provost
New York University
,
Ramakrishnan Srikant
IBM Almaden Research Center

Copyright © 2001 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
AAAI: American Association for Artificial Intelligence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 August 2001

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

KDD01

Sponsor:

KDD01: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 26 - 29, 2001

California, San Francisco

Acceptance Rates

KDD '01 Paper Acceptance Rate 31 of 237 submissions, 13%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '24

Sponsor:
sigkdd
sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

72
Total Citations
View Citations
1,360
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)5

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Vishva Gandhi Gajjar T(2024)Enhancing Fraud Detection in Financial Transactions through Cyber Security MeasuresInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT241028110:2(364-371)Online publication date: 20-Apr-2024
https://doi.org/10.32628/CSEIT2410281
Yamanishi KYamanishi K(2023)Parameter EstimationLearning with the Minimum Description Length Principle10.1007/978-981-99-1790-7_2(47-90)Online publication date: 15-Sep-2023
https://doi.org/10.1007/978-981-99-1790-7_2
Prusti DBehera RRath S(2022)Hybridizing graph‐based Gaussian mixture model with machine learning for classification of fraudulent transactionsComputational Intelligence10.1111/coin.1256138:6(2134-2160)Online publication date: 26-Nov-2022
https://doi.org/10.1111/coin.12561
Cinque MCorte RFarina GRosiello S(2022)An unsupervised approach to discover filtering rules from diagnostic logs2022 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW55968.2022.00030(1-6)Online publication date: Oct-2022
https://doi.org/10.1109/ISSREW55968.2022.00030
Trofti PPatrason AHiji A(2022)Unsupervised Abnormal Traffic Detection through Topological Flow Analysis2022 14th International Conference on Communications (COMM)10.1109/COMM54429.2022.9817285(1-6)Online publication date: 16-Jun-2022
https://doi.org/10.1109/COMM54429.2022.9817285
Aschi MBonura SMasi NMessina DProfeta D(2022)Cybersecurity and Fraud Detection in Financial TransactionsBig Data and Artificial Intelligence in Digital Finance10.1007/978-3-030-94590-9_15(269-278)Online publication date: 29-Apr-2022
https://doi.org/10.1007/978-3-030-94590-9_15
Hirayama KChen SSaiki SNakamura M(2021)Toward Capturing Scientific Evidence in Elderly Care: Efficient Extraction of Changing Facial Feature PointsSensors10.3390/s2120672621:20(6726)Online publication date: 10-Oct-2021
https://doi.org/10.3390/s21206726
Lai SWu JMa ZYe CZhou H(2021)Consumer Fraud Detection via P-feature Conversion2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC51774.2021.00052(318-323)Online publication date: Jul-2021
https://doi.org/10.1109/COMPSAC51774.2021.00052
Sen JMehtab S(2020)Machine Learning Applications in Misuse and Anomaly DetectionEthics, Laws, and Policies for Privacy, Security, and Liability [Working Title]10.5772/intechopen.92653Online publication date: 19-Jun-2020
https://doi.org/10.5772/intechopen.92653
Sen JMehtab S(2020)Introductory Chapter: Machine Learning in Misuse and Anomaly DetectionComputer and Network Security10.5772/intechopen.92168Online publication date: 10-Jun-2020
https://doi.org/10.5772/intechopen.92168
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents