article

On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms

Authors:

Kenji Yamanishi,

Jun-Ichi Takeuchi,

Graham Williams,

Peter MilneAuthors Info & Claims

Data Mining and Knowledge Discovery, Volume 8, Issue 3

Pages 275 - 300

https://doi.org/10.1023/B:DAMI.0000023676.72185.7c

Published: 01 May 2004 Publication History

Abstract

Outlier detection is a fundamental issue in data mining, specifically in fraud detection, network intrusion detection, network monitoring, etc. SmartSifter is an outlier detection engine addressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SmartSifter and empirically demonstrates its effectiveness. SmartSifter detects outliers in an on-line process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SmartSifter employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model with a high score indicating a high possibility of being a statistical outlier. The novel features of SmartSifter are: (1) it is adaptive to non-stationary sources of data; (2) a score has a clear statistical/information-theoretic meaning; (3) it is computationally inexpensive; and (4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.

References

[1]

Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. 1998. Topic detection and tracking pilot study: Final report, in Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 194-218.

[2]

Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data, John Wiley & Sons.

[3]

Bonchi, F., Giannotti, F., Mainetto, G., and Pedeschi, D. 1999. A classification-based methodology for planning audit strategies in fraud detection. In Proc. of KDD-99, pp. 175-184.

Digital Library

[4]

Burge, P. and Shawe-Taylor, J. 1997. Detecting cellular fraud using adaptive prototypes. In Proc. of AI Approaches to Fraud Detection and Risk Management, pp. 9-13.

[5]

Chan, P. and Stolfo, S. 1998. Toward scalable learning with non-uniform class and cost-distributions: A case study in credit card fraud detection. In Proc. of KDD-98, AAAI-Press, pp. 164-168.

[6]

Cover, T. and Thomas, J.A. 1991. Elements of Information Theory. Wiley-International.

Digital Library

[7]

Dempster, A.P., Laird, N.M., and Ribin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39(1):1-38.

[8]

Fawcett, T. and Provost, F. 1997. Combining data mining and machine learning for effective fraud detection. In Proc. of AI Approaches to Fraud Detection and Risk Management, pp. 14-19.

[9]

Fawcett, T. and Provost, F. 1999. Activity monitoring: Noticing interesting changes in behavior. In Proc. of KDD-99, pp. 53-62.

Digital Library

[10]

Grabec, I. 1990. Self-organization of Neurons described by the maximum-entropy principle, Biological Cybernetics, 63:403-409.

Digital Library

[11]

Guralnik, V. and Srivastava, J. 1999. Event detection from time series data. In Proc. KDD- 99, pp. 33-42.

Digital Library

[12]

Hawkins, D.M. 1980. Identification of Outliers. Chapman and Hall, London.

[13]

Hunt, L.A. and Jorgensen, M.A. 1999. Mixture model clustering: A brief introduction to the MULTMIX program, Australian & New Zealand Journal of Statistics, 40:153-171.

[14]

Knorr, E.M. and Ng, R.T. 1998. Algorithms for mining distance-based outliers in large datasets. In Proc. of the 24th VLDB Conference, pp. 392-403.

Digital Library

[15]

Knorr, E.M. and Ng, R.T. 1999. Finding intensional knowledge of distance-based outliers. In Proc. of the 25th VLDB Conference, pp. 211-222.

Digital Library

[16]

Krichevskii, R.E. and Trofimov, V.K. 1981. The performance of universal coding. IEEE Trans. Inform. Theory, IT-27(2):199-207.

[17]

Lane, T. and Brodley, C. 1998. Approaches to on-line learning and concept drift for user identification in computer security. In Proc. of KDD-98, AAAI Press, pp. 66-72.

[18]

Lee, W., Stolfo, S.J., and Mok, K.W. 1998. Mining audit data to build intrusion detection models. In Proc. of KDD-98.

[19]

Lee, W., Stolfo, S.J., and Mok, K.W. 1999. Mining in a data-flow environment: Experience in network intrusion detection. In Proc. of KDD-99, pp. 114-124.

Digital Library

[20]

Marron, J.S. and Wand, M.P. 1992. Exact mean integrated squared error. Annals of Statistics, 20:712-736.

[21]

McLachlan, G. and Peel, D. 2000. Finite Mixture Models. Wiley Series in Probability and Statistics, John Wiley and Sons.

[22]

Moreau, Y. and Vandewalle, J. Detection of mobile phone fraud using supervised neural networks: A first prototype, Available via: ftp://ftp.esat.kuleuven.ac.jp/pub/SISTA/moreau/reports/icann97 TR97-44.ps.

[23]

Neal, R.M. and Hinton, G.E. 1993. A view of the EM algorithm that justifies incremental, sparse, and other variants, ftp://ftp.cs.toronto.edu/pub/radford/www/publications.html,

[24]

Ng, S.K. and McLachlan, G.J. 2002. On the choice of the number of blocks with the incremental EM algorithm for the fitting of normal mixtures. Statistics & Computing. In press. Available at http://www.maths.uq.edu.au/ gim/increm.ps

Digital Library

[25]

Rocke, D.M. 1996. Robustness properties of S-estimators of multivariate location and shape in high dimension. Annals of Statistics, 24(3):1327-1345.

[26]

Rosset, S., Murad, U., Neumann, E., Idan, Y., and Pinkas, G. 1999. Discovery of fraud rules for telecommunications-challenges and solutions. In Proc. of KDD-99, pp. 409-413.

Digital Library

[27]

Williams, G.J. and Huang, Z. 1997. Mining the knowledge mine: The hot spots methodology for mining large real world databases. In Advanced Topics in Artificial Intelligence Lecture Notes in Artificial Intelligence, volume 1342, Springer-Verlag, pp. 340-348.

Digital Library

[28]

Yamanishi, K., Takeuchi, J., Williams, G., and Milne, P. 2000. On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proc. of KDD2000, ACM Press, pp. 250-254.

Digital Library

Cited By

Jiang JChen YHe BChen MChen J(2024)Spade+: A Generic Real-Time Fraud Detection Framework on Dynamic GraphsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339415536:11(7058-7073)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3394155
Wadinger MKvasnica M(2024)Adaptable and Interpretable Framework for Anomaly Detection in SCADA-based industrial systemsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123200246:COnline publication date: 15-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.123200
Ren YZhou RZhang JWang W(2023)An Outlier Detection Algorithm Based on Probability Density ClusteringInternational Journal of Data Warehousing and Mining10.4018/IJDWM.33390119:1(1-20)Online publication date: 21-Nov-2023
https://dl.acm.org/doi/10.4018/IJDWM.333901
Show More Cited By

Index Terms

On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms

Recommendations

A Novel Outlier Detection Scheme for Network Intrusion Detection Systems
ISA '08: Proceedings of the 2008 International Conference on Information Security and Assurance (isa 2008)

Network intrusion detection system serves as a second line of defense to intrusion prevention. Anomaly detection approach is important in order to detect new attacks. This paper adopted connectivity-based outlier detection scheme from statistical field ...
Enhancing Outlier Detection by an Outlier Indicator
Machine Learning and Data Mining in Pattern Recognition
Abstract
Outlier detection is an important task in data mining and has high practical value in numerous applications such as astronomical observation, text detection, fraud detection and so on. At present, a large number of popular outlier detection ...
CoMadOut—a robust outlier detection algorithm based on CoMAD
Abstract
Unsupervised learning methods are well established in the area of anomaly detection and achieve state of the art performances on outlier datasets. Outliers play a significant role, since they bear the potential to distort the predictions of a ...

Comments

Information & Contributors

Information

Published In

cover image Data Mining and Knowledge Discovery

Data Mining and Knowledge Discovery Volume 8, Issue 3

May 2004

96 pages

ISSN:1384-5810

Issue’s Table of Contents

Copyright © Copyright © 2004 Kluwer Academic Publishers.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 May 2004

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

102
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jiang JChen YHe BChen MChen J(2024)Spade+: A Generic Real-Time Fraud Detection Framework on Dynamic GraphsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339415536:11(7058-7073)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3394155
Wadinger MKvasnica M(2024)Adaptable and Interpretable Framework for Anomaly Detection in SCADA-based industrial systemsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123200246:COnline publication date: 15-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.123200
Ren YZhou RZhang JWang W(2023)An Outlier Detection Algorithm Based on Probability Density ClusteringInternational Journal of Data Warehousing and Mining10.4018/IJDWM.33390119:1(1-20)Online publication date: 21-Nov-2023
https://dl.acm.org/doi/10.4018/IJDWM.333901
Arakawa RZhou BKrishnan GGoel MNayar S(2023)MI-PoserProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36108917:3(1-24)Online publication date: 27-Sep-2023
https://dl.acm.org/doi/10.1145/3610891
Idé TAbe NSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)Generative Perturbation Analysis for Probabilistic Black-Box Anomaly AttributionProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599365(845-856)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599365
Yakura H(2023)A Generative Framework for Designing Interactions to Overcome the Gaps between Humans and Imperfect AIs Instead of Improving the Accuracy of the AIsExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544549.3577036(1-5)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544549.3577036
Arakawa RYakura H(2023)AI for human assessment: What do professional assessors need?Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544549.3573849(1-7)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544549.3573849
Huang CXu QWang YWang YZhang Y(2023)Self-Supervised Masking for Unsupervised Anomaly Detection and LocalizationIEEE Transactions on Multimedia10.1109/TMM.2022.317561125(4426-4438)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3175611
Li SLi SXie MGong KZhao JLiu CWang G(2023)End-to-End Transferable Anomaly Detection via Multi-Spectral Cross-Domain Representation AlignmentIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.311811135:12(12194-12207)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TKDE.2021.3118111
Angiulli FFassetti FSerrao C(2023)Anomaly detection with correlation lawsData & Knowledge Engineering10.1016/j.datak.2023.102181145:COnline publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.datak.2023.102181
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents