Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Enhancing Unsupervised Outlier Model Selection: A Study on IREOS Algorithms

Published: 19 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Outlier detection stands as a critical cornerstone in the field of data mining, with a wide range of applications spanning from fraud detection to network security. However, real-world scenarios often lack labeled data for training, necessitating unsupervised outlier detection methods. This study centers on Unsupervised Outlier Model Selection (UOMS), with a specific focus on the family of Internal, Relative Evaluation of Outlier Solutions (IREOS) algorithms. IREOS measures outlier candidate separability by evaluating multiple maximum-margin classifiers and, while effective, it is constrained by its high computational demands. We investigate the impact of several different separation methods in UOMS in terms of ranking quality and runtime. Surprisingly, our findings indicate that different separability measures have minimal impact on IREOS’ effectiveness. However, using linear separation methods within IREOS significantly reduces its computation time. These insights hold significance for real-world applications where efficient outlier detection is critical. In the context of this work, we provide the code for the IREOS algorithm and our separability techniques.

    References

    [1]
    Hervé Abdi. 2007. The bonferonni and Šidák corrections for multiple comparisons. Encyclopedia of Measurement and Statistics 3.01 (2007).
    [2]
    Fabrizio Angiulli and Clara Pizzuti. 2002. Fast outlier detection in high dimensional spaces. In Proceedings of the Principles of Data Mining and Knowledge Discovery.Tapio Elomaa, Heikki Mannila, and Hannu Toivonen (Eds.), Lecture Notes in Computer Science, Springer, Berlin,15–27. DOI:
    [3]
    Azzedine Boukerche, Lining Zheng, and Omar Alfandi. 2021. Outlier detection: Methods, models, and classification. ACM Computing Surveys 53, 3 (2021), 1–37. DOI:
    [4]
    Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, Dallas Texas USA, 93–104. DOI:
    [5]
    Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J. G. B. Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. 2016. On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30, 4 (2016), 891–927. DOI:
    [6]
    Debo Cheng, Shichao Zhang, Zhenyun Deng, Yonghua Zhu, and Ming Zong. 2014. kNN algorithm with data-driven k value. In Proceedings of the Advanced Data Mining and Applications, Xudong Luo, Jeffrey Xu Yu, and Zhi Li (Eds.). Springer International Publishing, Cham, 499–512. DOI:
    [7]
    S. Duan, L. Matthey, A. Saraiva, N. Watters, C. P. Burgess, A. Lerchner, and I. Higgins. 2019. Unsupervised model selection for variational disentangled representation learning. arXiv preprint arXiv:1905.12614.
    [8]
    N. Goix. 2016. How to evaluate the quality of unsupervised anomaly detection algorithms? arXiv preprint arXiv:1607.01152.
    [9]
    Frank E. Grubbs. 2012. Procedures for detecting outlying observations in samples. Technometrics 11, 1 (2012), 1--21. Retrieved from
    [10]
    Xiaofei He, Deng Cai, and Partha Niyogi. 2005. Laplacian score for feature selection. In Proceedings of the Advances in Neural Information Processing Systems. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2005/hash/b5b03f06271f8917685d14cea7c6c50a-Abstract.html
    [11]
    Jon M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 5 (1999), 604–632. DOI:
    [12]
    Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. 2011. Interpreting and unifying outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 13–24. DOI:
    [13]
    Hans-Peter Kriegel, Matthias Schubert, and Arthur Zimek. 2008. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08). Association for Computing Machinery, New York, NY, USA, 444–452. DOI:
    [14]
    Longin Jan Latecki, Aleksandar Lazarevic, and Dragoljub Pokrajac. 2007. Outlier detection with kernel density functions. In Proceedings of the Machine Learning and Data Mining in Pattern Recognition. Petra Perner (Ed.), Vol. 4571. Springer, Berlin,61–75. DOI:
    [15]
    Jiaye Li, Jian Zhang, Jilian Zhang, and Shichao Zhang. 2023. Quantum KNN classification with k value selection and neighbor selection. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2023), 1–1. DOI:
    [16]
    Zinan Lin, Kiran Thekumparampil, Giulia Fanti, and Sewoong Oh. 2020. InfoGAN-CR and modelcentrality: Self-supervised model training and selection for disentangling GANs. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 6127–6139. Retrieved from https://proceedings.mlr.press/v119/lin20e.html
    [17]
    Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In Proceedings of the 2008 8th IEEE International Conference on Data Mining. 413–422. DOI:
    [18]
    M. Goldstein and A. Dengel. 2012. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. KI-2012 Poster and Demo Track (2012), 59--63.
    [19]
    Martin Q. Ma, Yue Zhao, Xiaorong Zhang, and Leman Akoglu. 2023. The need for unsupervised outlier model selection: A review and evaluation of internal evaluation strategies. ACM SIGKDD Explorations Newsletter 1, 25.1 (2023).
    [20]
    Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. [n. d.]. The UCI Machine Learning Repository. Retrieved 05 September 2023 from http://archive.ics.uci.edu/
    [21]
    Henrique O. Marques, Jörg Sander, and Arthur Zimek. 2020. Internal evaluation of unsupervised outlier detection. ACM Transactions on Knowledge Discovery from Data 14, 4 (2020), 42.
    [22]
    Mei-Ling Shyu. 2003. A Novel Anomaly Detection Scheme Based on Principal Component Classifier. Technical Report. Retrieved from https://apps.dtic.mil/sti/citations/ADA465712
    [23]
    Barbora Micenková, Raymond T. Ng, Xuan-Hong Dang, and Ira Assent. 2013. Explaining outliers by subspace separability. In Proceedings of the 2013 IEEE 13th International Conference on Data Mining. 518–527. DOI:
    [24]
    Van Huy Nguyen, Thanh Trung Nguyen, and Uy Quang Nguyen. 2016. An evaluation method for unsupervised anomaly detection algorithms. Journal of Computer Science and Cybernetics 32, 3 (2016), 259–272. DOI:
    [25]
    Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. 2021. Deep learning for anomaly detection: A review. ACM Computing Surveys 54, 2 (2021), 38:1–38:38. DOI:
    [26]
    Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, and David Cournapeau. 2011. Scikit-learn: Machine learning in python. MACHINE LEARNING IN PYTHON 12 (2011), 2825--2830.
    [27]
    Lukas Ruff, Jacob R. Kauffmann, Robert A. Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G. Dietterich, and Klaus-Robert Müller. 2021. A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE 109, 5 (2021), 756–795. DOI:
    [28]
    Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural Computation 13, 7 (2001), 1443–1471. DOI:
    [29]
    Shebuti Rayana. 2016. ODDS – Outlier Detection DataSets. Retrieved 11 September 2023 from https://odds.cs.stonybrook.edu/
    [30]
    A. Talwalkar and A. Rostamizadeh. 2014. Matrix coherence and the nystrom method. arXiv preprint arXiv:1408.2044.
    [31]
    Jian Tang, Zhixiang Chen, Ada Wai-chee Fu, and David W. Cheung. 2002. Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the Advances in Knowledge Discovery and Data Mining.Ming-Syan Chen, Philip S. Yu, and Bing Liu (Eds.), Lecture Notes in Computer Science, Springer, Berlin,535–548. DOI:
    [32]
    Hongzhi Wang, Mohamed Jaward Bah, and Mohamed Hammad. 2019. Progress in outlier detection techniques: A survey. IEEE Access 7 (2019), 107964–108000. DOI:
    [33]
    Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Proceedings of the Breakthroughs in Statistics: Methodology and Distribution. Samuel Kotz and Norman L. Johnson (Eds.), Springer, New York, NY, 196–202. DOI:
    [34]
    Shichao Zhang. 2022. Challenges in KNN classification. IEEE Transactions on Knowledge and Data Engineering 34, 10 (2022), 4663–4675. DOI:
    [35]
    Shichao Zhang and Jiaye Li. 2021. KNN classification with one-step computation. IEEE Transactions on Knowledge and Data Engineering 35, 3 (2021), 2711–2723. DOI:
    [36]
    Shichao Zhang, Jiaye Li, and Yangding Li. 2022. Reachable distance function for KNN classification. IEEE Transactions on Knowledge and Data Engineering 35, 7 (2022), 7382–7396. DOI:
    [37]
    Shichao Zhang, Jiaye Li, Wenzhen Zhang, and Yongsong Qin. 2022. Hyper-class representation of data. Neurocomputing 503 (2022), 200–218. DOI:
    [38]
    Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. 2018. Efficient kNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems 29, 5 (2018), 1774–1785. DOI:
    [39]
    Daniel Zwillinger and Stephen Kokoska. 1999. CRC Standard Probability and Statistics Tables and Formulae. CRC Press.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 18, Issue 7
    August 2024
    505 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3613689
    • Editor:
    • Jian Pei
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 June 2024
    Online AM: 05 April 2024
    Accepted: 19 March 2024
    Revised: 18 March 2024
    Received: 24 October 2023
    Published in TKDD Volume 18, Issue 7

    Check for updates

    Author Tags

    1. Outlier detection
    2. anomaly detection
    3. unsupervised evaluation
    4. model selection

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 111
      Total Downloads
    • Downloads (Last 12 months)111
    • Downloads (Last 6 weeks)24

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media