research-article

Enhancing Unsupervised Outlier Model Selection: A Study on IREOS Algorithms

Authors:

Philipp Schlieper,

Christoph Strohmeyer,

Bjoern Eskofier, and

Dario ZancaAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 18, Issue 7

Article No.: 178, Pages 1 - 25

https://doi.org/10.1145/3653719

Published: 19 June 2024 Publication History

Abstract

Outlier detection stands as a critical cornerstone in the field of data mining, with a wide range of applications spanning from fraud detection to network security. However, real-world scenarios often lack labeled data for training, necessitating unsupervised outlier detection methods. This study centers on Unsupervised Outlier Model Selection (UOMS), with a specific focus on the family of Internal, Relative Evaluation of Outlier Solutions (IREOS) algorithms. IREOS measures outlier candidate separability by evaluating multiple maximum-margin classifiers and, while effective, it is constrained by its high computational demands. We investigate the impact of several different separation methods in UOMS in terms of ranking quality and runtime. Surprisingly, our findings indicate that different separability measures have minimal impact on IREOS’ effectiveness. However, using linear separation methods within IREOS significantly reduces its computation time. These insights hold significance for real-world applications where efficient outlier detection is critical. In the context of this work, we provide the code for the IREOS algorithm and our separability techniques.

References

[1]

Hervé Abdi. 2007. The bonferonni and Šidák corrections for multiple comparisons. Encyclopedia of Measurement and Statistics 3.01 (2007).

[2]

Fabrizio Angiulli and Clara Pizzuti. 2002. Fast outlier detection in high dimensional spaces. In Proceedings of the Principles of Data Mining and Knowledge Discovery.Tapio Elomaa, Heikki Mannila, and Hannu Toivonen (Eds.), Lecture Notes in Computer Science, Springer, Berlin,15–27. DOI:

[3]

Azzedine Boukerche, Lining Zheng, and Omar Alfandi. 2021. Outlier detection: Methods, models, and classification. ACM Computing Surveys 53, 3 (2021), 1–37. DOI:

Digital Library

[4]

Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, Dallas Texas USA, 93–104. DOI:

Digital Library

[5]

Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J. G. B. Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. 2016. On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30, 4 (2016), 891–927. DOI:

Digital Library

[6]

Debo Cheng, Shichao Zhang, Zhenyun Deng, Yonghua Zhu, and Ming Zong. 2014. kNN algorithm with data-driven k value. In Proceedings of the Advanced Data Mining and Applications, Xudong Luo, Jeffrey Xu Yu, and Zhi Li (Eds.). Springer International Publishing, Cham, 499–512. DOI:

[7]

S. Duan, L. Matthey, A. Saraiva, N. Watters, C. P. Burgess, A. Lerchner, and I. Higgins. 2019. Unsupervised model selection for variational disentangled representation learning. arXiv preprint arXiv:1905.12614.

[8]

N. Goix. 2016. How to evaluate the quality of unsupervised anomaly detection algorithms? arXiv preprint arXiv:1607.01152.

[9]

Frank E. Grubbs. 2012. Procedures for detecting outlying observations in samples. Technometrics 11, 1 (2012), 1--21. Retrieved from

[10]

Xiaofei He, Deng Cai, and Partha Niyogi. 2005. Laplacian score for feature selection. In Proceedings of the Advances in Neural Information Processing Systems. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2005/hash/b5b03f06271f8917685d14cea7c6c50a-Abstract.html

[11]

Jon M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 5 (1999), 604–632. DOI:

Digital Library

[12]

Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. 2011. Interpreting and unifying outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 13–24. DOI:

[13]

Hans-Peter Kriegel, Matthias Schubert, and Arthur Zimek. 2008. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08). Association for Computing Machinery, New York, NY, USA, 444–452. DOI:

Digital Library

[14]

Longin Jan Latecki, Aleksandar Lazarevic, and Dragoljub Pokrajac. 2007. Outlier detection with kernel density functions. In Proceedings of the Machine Learning and Data Mining in Pattern Recognition. Petra Perner (Ed.), Vol. 4571. Springer, Berlin,61–75. DOI:

Digital Library

[15]

Jiaye Li, Jian Zhang, Jilian Zhang, and Shichao Zhang. 2023. Quantum KNN classification with k value selection and neighbor selection. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2023), 1–1. DOI:

[16]

Zinan Lin, Kiran Thekumparampil, Giulia Fanti, and Sewoong Oh. 2020. InfoGAN-CR and modelcentrality: Self-supervised model training and selection for disentangling GANs. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 6127–6139. Retrieved from https://proceedings.mlr.press/v119/lin20e.html

[17]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In Proceedings of the 2008 8th IEEE International Conference on Data Mining. 413–422. DOI:

Digital Library

[18]

M. Goldstein and A. Dengel. 2012. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. KI-2012 Poster and Demo Track (2012), 59--63.

[19]

Martin Q. Ma, Yue Zhao, Xiaorong Zhang, and Leman Akoglu. 2023. The need for unsupervised outlier model selection: A review and evaluation of internal evaluation strategies. ACM SIGKDD Explorations Newsletter 1, 25.1 (2023).

[20]

Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. [n. d.]. The UCI Machine Learning Repository. Retrieved 05 September 2023 from http://archive.ics.uci.edu/

[21]

Henrique O. Marques, Jörg Sander, and Arthur Zimek. 2020. Internal evaluation of unsupervised outlier detection. ACM Transactions on Knowledge Discovery from Data 14, 4 (2020), 42.

Digital Library

[22]

Mei-Ling Shyu. 2003. A Novel Anomaly Detection Scheme Based on Principal Component Classifier. Technical Report. Retrieved from https://apps.dtic.mil/sti/citations/ADA465712

[23]

Barbora Micenková, Raymond T. Ng, Xuan-Hong Dang, and Ira Assent. 2013. Explaining outliers by subspace separability. In Proceedings of the 2013 IEEE 13th International Conference on Data Mining. 518–527. DOI:

[24]

Van Huy Nguyen, Thanh Trung Nguyen, and Uy Quang Nguyen. 2016. An evaluation method for unsupervised anomaly detection algorithms. Journal of Computer Science and Cybernetics 32, 3 (2016), 259–272. DOI:

[25]

Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. 2021. Deep learning for anomaly detection: A review. ACM Computing Surveys 54, 2 (2021), 38:1–38:38. DOI:

Digital Library

[26]

Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, and David Cournapeau. 2011. Scikit-learn: Machine learning in python. MACHINE LEARNING IN PYTHON 12 (2011), 2825--2830.

[27]

Lukas Ruff, Jacob R. Kauffmann, Robert A. Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G. Dietterich, and Klaus-Robert Müller. 2021. A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE 109, 5 (2021), 756–795. DOI:

[28]

Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural Computation 13, 7 (2001), 1443–1471. DOI:

Digital Library

[29]

Shebuti Rayana. 2016. ODDS – Outlier Detection DataSets. Retrieved 11 September 2023 from https://odds.cs.stonybrook.edu/

[30]

A. Talwalkar and A. Rostamizadeh. 2014. Matrix coherence and the nystrom method. arXiv preprint arXiv:1408.2044.

[31]

Jian Tang, Zhixiang Chen, Ada Wai-chee Fu, and David W. Cheung. 2002. Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the Advances in Knowledge Discovery and Data Mining.Ming-Syan Chen, Philip S. Yu, and Bing Liu (Eds.), Lecture Notes in Computer Science, Springer, Berlin,535–548. DOI:

[32]

Hongzhi Wang, Mohamed Jaward Bah, and Mohamed Hammad. 2019. Progress in outlier detection techniques: A survey. IEEE Access 7 (2019), 107964–108000. DOI:

[33]

Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Proceedings of the Breakthroughs in Statistics: Methodology and Distribution. Samuel Kotz and Norman L. Johnson (Eds.), Springer, New York, NY, 196–202. DOI:

[34]

Shichao Zhang. 2022. Challenges in KNN classification. IEEE Transactions on Knowledge and Data Engineering 34, 10 (2022), 4663–4675. DOI:

Digital Library

[35]

Shichao Zhang and Jiaye Li. 2021. KNN classification with one-step computation. IEEE Transactions on Knowledge and Data Engineering 35, 3 (2021), 2711–2723. DOI:

[36]

Shichao Zhang, Jiaye Li, and Yangding Li. 2022. Reachable distance function for KNN classification. IEEE Transactions on Knowledge and Data Engineering 35, 7 (2022), 7382–7396. DOI:

Digital Library

[37]

Shichao Zhang, Jiaye Li, Wenzhen Zhang, and Yongsong Qin. 2022. Hyper-class representation of data. Neurocomputing 503 (2022), 200–218. DOI:

Digital Library

[38]

Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. 2018. Efficient kNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems 29, 5 (2018), 1774–1785. DOI:

[39]

Daniel Zwillinger and Stephen Kokoska. 1999. CRC Standard Probability and Statistics Tables and Formulae. CRC Press.

Index Terms

Enhancing Unsupervised Outlier Model Selection: A Study on IREOS Algorithms
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Anomaly detection
  2. Modeling and simulation
    1. Model development and analysis
      1. Model verification and validation

Recommendations

Enhancing Outlier Detection by an Outlier Indicator
Machine Learning and Data Mining in Pattern Recognition
Abstract
Outlier detection is an important task in data mining and has high practical value in numerous applications such as astronomical observation, text detection, fraud detection and so on. At present, a large number of popular outlier detection ...
Read More
Internal Evaluation of Unsupervised Outlier Detection

Although there is a large and growing literature that tackles the unsupervised outlier detection problem, the unsupervised evaluation of outlier detection results is still virtually untouched in the literature. The so-called internal evaluation, based ...
Read More
On the internal evaluation of unsupervised outlier detection
SSDBM '15: Proceedings of the 27th International Conference on Scientific and Statistical Database Management

Although there is a large and growing literature that tackles the unsupervised outlier detection problem, the unsupervised evaluation of outlier detection results is still virtually untouched in the literature. The so-called internal evaluation, based ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 18, Issue 7

August 2024

505 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3613689

Editor:
Jian Pei
Duke University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2024

Online AM: 05 April 2024

Accepted: 19 March 2024

Revised: 18 March 2024

Received: 24 October 2023

Published in TKDD Volume 18, Issue 7

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
111
Total Downloads

Downloads (Last 12 months)111
Downloads (Last 6 weeks)24

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents