Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms

Published: 01 May 2004 Publication History

Abstract

Outlier detection is a fundamental issue in data mining, specifically in fraud detection, network intrusion detection, network monitoring, etc. SmartSifter is an outlier detection engine addressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SmartSifter and empirically demonstrates its effectiveness. SmartSifter detects outliers in an on-line process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SmartSifter employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model with a high score indicating a high possibility of being a statistical outlier. The novel features of SmartSifter are: (1) it is adaptive to non-stationary sources of data; (2) a score has a clear statistical/information-theoretic meaning; (3) it is computationally inexpensive; and (4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.

References

[1]
Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. 1998. Topic detection and tracking pilot study: Final report, in Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 194-218.
[2]
Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data, John Wiley & Sons.
[3]
Bonchi, F., Giannotti, F., Mainetto, G., and Pedeschi, D. 1999. A classification-based methodology for planning audit strategies in fraud detection. In Proc. of KDD-99, pp. 175-184.
[4]
Burge, P. and Shawe-Taylor, J. 1997. Detecting cellular fraud using adaptive prototypes. In Proc. of AI Approaches to Fraud Detection and Risk Management, pp. 9-13.
[5]
Chan, P. and Stolfo, S. 1998. Toward scalable learning with non-uniform class and cost-distributions: A case study in credit card fraud detection. In Proc. of KDD-98, AAAI-Press, pp. 164-168.
[6]
Cover, T. and Thomas, J.A. 1991. Elements of Information Theory. Wiley-International.
[7]
Dempster, A.P., Laird, N.M., and Ribin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39(1):1-38.
[8]
Fawcett, T. and Provost, F. 1997. Combining data mining and machine learning for effective fraud detection. In Proc. of AI Approaches to Fraud Detection and Risk Management, pp. 14-19.
[9]
Fawcett, T. and Provost, F. 1999. Activity monitoring: Noticing interesting changes in behavior. In Proc. of KDD-99, pp. 53-62.
[10]
Grabec, I. 1990. Self-organization of Neurons described by the maximum-entropy principle, Biological Cybernetics, 63:403-409.
[11]
Guralnik, V. and Srivastava, J. 1999. Event detection from time series data. In Proc. KDD- 99, pp. 33-42.
[12]
Hawkins, D.M. 1980. Identification of Outliers. Chapman and Hall, London.
[13]
Hunt, L.A. and Jorgensen, M.A. 1999. Mixture model clustering: A brief introduction to the MULTMIX program, Australian & New Zealand Journal of Statistics, 40:153-171.
[14]
Knorr, E.M. and Ng, R.T. 1998. Algorithms for mining distance-based outliers in large datasets. In Proc. of the 24th VLDB Conference, pp. 392-403.
[15]
Knorr, E.M. and Ng, R.T. 1999. Finding intensional knowledge of distance-based outliers. In Proc. of the 25th VLDB Conference, pp. 211-222.
[16]
Krichevskii, R.E. and Trofimov, V.K. 1981. The performance of universal coding. IEEE Trans. Inform. Theory, IT-27(2):199-207.
[17]
Lane, T. and Brodley, C. 1998. Approaches to on-line learning and concept drift for user identification in computer security. In Proc. of KDD-98, AAAI Press, pp. 66-72.
[18]
Lee, W., Stolfo, S.J., and Mok, K.W. 1998. Mining audit data to build intrusion detection models. In Proc. of KDD-98.
[19]
Lee, W., Stolfo, S.J., and Mok, K.W. 1999. Mining in a data-flow environment: Experience in network intrusion detection. In Proc. of KDD-99, pp. 114-124.
[20]
Marron, J.S. and Wand, M.P. 1992. Exact mean integrated squared error. Annals of Statistics, 20:712-736.
[21]
McLachlan, G. and Peel, D. 2000. Finite Mixture Models. Wiley Series in Probability and Statistics, John Wiley and Sons.
[22]
Moreau, Y. and Vandewalle, J. Detection of mobile phone fraud using supervised neural networks: A first prototype, Available via: ftp://ftp.esat.kuleuven.ac.jp/pub/SISTA/moreau/reports/icann97 TR97-44.ps.
[23]
Neal, R.M. and Hinton, G.E. 1993. A view of the EM algorithm that justifies incremental, sparse, and other variants, ftp://ftp.cs.toronto.edu/pub/radford/www/publications.html,
[24]
Ng, S.K. and McLachlan, G.J. 2002. On the choice of the number of blocks with the incremental EM algorithm for the fitting of normal mixtures. Statistics & Computing. In press. Available at http://www.maths.uq.edu.au/ gim/increm.ps
[25]
Rocke, D.M. 1996. Robustness properties of S-estimators of multivariate location and shape in high dimension. Annals of Statistics, 24(3):1327-1345.
[26]
Rosset, S., Murad, U., Neumann, E., Idan, Y., and Pinkas, G. 1999. Discovery of fraud rules for telecommunications-challenges and solutions. In Proc. of KDD-99, pp. 409-413.
[27]
Williams, G.J. and Huang, Z. 1997. Mining the knowledge mine: The hot spots methodology for mining large real world databases. In Advanced Topics in Artificial Intelligence Lecture Notes in Artificial Intelligence, volume 1342, Springer-Verlag, pp. 340-348.
[28]
Yamanishi, K., Takeuchi, J., Williams, G., and Milne, P. 2000. On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proc. of KDD2000, ACM Press, pp. 250-254.

Cited By

View all
  • (2024)Spade+: A Generic Real-Time Fraud Detection Framework on Dynamic GraphsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339415536:11(7058-7073)Online publication date: 1-Nov-2024
  • (2024)Adaptable and Interpretable Framework for Anomaly Detection in SCADA-based industrial systemsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123200246:COnline publication date: 15-Jul-2024
  • (2023)An Outlier Detection Algorithm Based on Probability Density ClusteringInternational Journal of Data Warehousing and Mining10.4018/IJDWM.33390119:1(1-20)Online publication date: 21-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Data Mining and Knowledge Discovery
Data Mining and Knowledge Discovery  Volume 8, Issue 3
May 2004
96 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 May 2004

Author Tags

  1. EM algorithm
  2. anomaly detection
  3. finite mixture model
  4. fraud detection
  5. intrusion detection
  6. outlier detection

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Spade+: A Generic Real-Time Fraud Detection Framework on Dynamic GraphsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339415536:11(7058-7073)Online publication date: 1-Nov-2024
  • (2024)Adaptable and Interpretable Framework for Anomaly Detection in SCADA-based industrial systemsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123200246:COnline publication date: 15-Jul-2024
  • (2023)An Outlier Detection Algorithm Based on Probability Density ClusteringInternational Journal of Data Warehousing and Mining10.4018/IJDWM.33390119:1(1-20)Online publication date: 21-Nov-2023
  • (2023)MI-PoserProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36108917:3(1-24)Online publication date: 27-Sep-2023
  • (2023)Generative Perturbation Analysis for Probabilistic Black-Box Anomaly AttributionProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599365(845-856)Online publication date: 6-Aug-2023
  • (2023)A Generative Framework for Designing Interactions to Overcome the Gaps between Humans and Imperfect AIs Instead of Improving the Accuracy of the AIsExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544549.3577036(1-5)Online publication date: 19-Apr-2023
  • (2023)AI for human assessment: What do professional assessors need?Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544549.3573849(1-7)Online publication date: 19-Apr-2023
  • (2023)Self-Supervised Masking for Unsupervised Anomaly Detection and LocalizationIEEE Transactions on Multimedia10.1109/TMM.2022.317561125(4426-4438)Online publication date: 1-Jan-2023
  • (2023)End-to-End Transferable Anomaly Detection via Multi-Spectral Cross-Domain Representation AlignmentIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.311811135:12(12194-12207)Online publication date: 1-Dec-2023
  • (2023)Anomaly detection with correlation lawsData & Knowledge Engineering10.1016/j.datak.2023.102181145:COnline publication date: 1-May-2023
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media