research-article

Open access

AutoOD: Automatic Outlier Detection

Authors:

Elke A. RundensteinerAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 1

Article No.: 20, Pages 1 - 27

https://doi.org/10.1145/3588700

Published: 30 May 2023 Publication History

Abstract

Outlier detection is critical in real world. Due to the existence of many outlier detection techniques which often return different results for the same data set, the users have to address the problem of determining which among these techniques is the best suited for their task and tune its parameters. This is particularly challenging in the unsupervised setting, where no labels are available for cross-validation needed for such method and parameter optimization. In this work, we propose AutoOD which uses the existing unsupervised detection techniques to automatically produce high quality outliers without any human tuning. AutoOD's fundamentally new strategy unifies the merits of unsupervised outlier detection and supervised classification within one integrated solution. It automatically tests a diverse set of unsupervised outlier detectors on a target data set, extracts useful signals from their combined detection results to reliably capture key differences between outliers and inliers. It then uses these signals to produce a "custom outlier classifier" to classify outliers, with its accuracy comparable to supervised outlier classification models trained with ground truth labels - without having access to the much needed labels. On a diverse set of benchmark outlier detection datasets, AutoOD consistently outperforms the best unsupervised outlier detector selected from hundreds of detectors. It also outperforms other tuning-free approaches from 12 to 97 points (out of 100) in the F-1 score.

Supplemental Material

MP4 File

The video of AutoOD: Automatic Outlier Detection

Download
18.09 MB

References

[1]

2017. Forrester Prediction. https://go.forrester.com.

[2]

2022. AutoOD: Automatic Outlier Detection. https://drive.google.com/file/d/1ZQMoySvarne4hP_UfHcs6aB3dhffPyY0/view?usp=sharing.

[3]

Davide Abati, Angelo Porrello, Simone Calderara, and Rita Cucchiara. 2019. Latent space autoregression for novelty detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 481--490.

[4]

Charu C. Aggarwal. 2017. Outlier Analysis: Second Edition. Springer.

[5]

Charu C Aggarwal and Saket Sathe. 2015. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter 17, 1 (2015), 24--47.

Digital Library

[6]

Charu C Aggarwal and Saket Sathe. 2017. Outlier ensembles: An introduction. Springer.

[7]

Samet Akcay, Amir Atapour-Abarghouei, and Toby P Breckon. 2018. Ganomaly: Semi-supervised anomaly detection via adversarial training. In Asian conference on computer vision. Springer, 622--637.

[8]

Sarah Miller Al Pascual, Kyle Marchini. 2017. Identity Fraud: Securing the Connected Life. (2017).

[9]

Jerone TA Andrews, Edward J Morton, and Lewis D Griffin. 2016. Detecting anomalous data using auto-encoders. International Journal of Machine Learning and Computing 6, 1 (2016), 21.

[10]

Fabrizio Angiulli and Clara Pizzuti. 2002. Fast Outlier Detection in High Dimensional Spaces. In PKDD. 15--26.

[11]

Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. Macrobase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 541--556.

Digital Library

[12]

Vic Barnett, Toby Lewis, et al. 1994. Outliers in statistical data. Vol. 3. Wiley New York.

[13]

Laura Beggel, Michael Pfeiffer, and Bernd Bischl. 2019. Robust anomaly detection in images using adversarial autoencoders. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 206--222.

[14]

James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of machine learning research 13, Feb (2012), 281--305.

Digital Library

[15]

Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, Eli Upfal, Robert Zeleznik, and Emanuel Zgraggen. 2018. Towards interactive curation & automatic tuning of ml pipelines. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning. 1--4.

Digital Library

[16]

Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying Density-Based Local Outliers. In SIGMOD. 93--104.

Digital Library

[17]

Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J. G. B. Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. 2016. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30, 4 (2016), 891--927.

Digital Library

[18]

Lei Cao, Mingrui Wei, Di Yang, and Elke A Rundensteiner. 2015. Online outlier exploration over large datasets. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 89--98.

Digital Library

[19]

Lei Cao, Yizhou Yan, Caitlin Kuhlman, Qingyang Wang, Elke A Rundensteiner, and Mohamed Eltabakh. 2017. Multi-Tactic Distance-Based Outlier Detection. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE, 959--970.

[20]

Chengliang Chai, Lei Cao, Guoliang Li, Jian Li, Yuyu Luo, and Samuel Madden. 2020. Human-in-the-Loop Outlier Detection. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 19--33. https://doi.org/10.1145/3318464.3389772

Digital Library

[21]

Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. 2017. Robust, deep and inductive anomaly detection. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 36--51.

[22]

V. Chandola, A. Banerjee, and V. Kumar. 2009. Anomaly detection. Comput. Surveys 41, 3 (2009), 1--58.

Digital Library

[23]

Jinghui Chen, Saket Sathe, Charu Aggarwal, and Deepak Turaga. 2017. Outlier detection with autoencoder ensembles. In Proceedings of the 2017 SIAM international conference on data mining. SIAM, 90--98.

[24]

Simão Eduardo, Alfredo Nazábal, Christopher K. I. Williams, and Charles Sutton. 2020. Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26--28 August 2020, Online [Palermo, Sicily, Italy]. 4056--4066.

[25]

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. In Advances in neural information processing systems. 2962--2970.

[26]

Jing Gao, Jiawei Han, Jialu Liu, and Chi Wang. 2013. Multi-View Clustering via Joint Nonnegative Matrix Factorization. In SDM, May 2--4, 2013. Austin, Texas, USA. 252--260.

[27]

Izhak Golan and Ran El-Yaniv. 2018. Deep anomaly detection using geometric transformations. In NeurIPS. 9758--9769.

[28]

Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. 2017. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 1487--1495.

Digital Library

[29]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Digital Library

[30]

Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd Edition. Springer.

[31]

Douglas M Hawkins. 1980. Identification of outliers. Vol. 11. Springer.

[32]

Xin He, Kaiyong Zhao, and Xiaowen Chu. 2019. AutoML: A Survey of the State-of-the-Art. arXiv preprint arXiv:1908.00709 (2019).

[33]

Zengyou He, Xiaofei Xu, and Shengchun Deng. 2003. Discovering cluster-based local outliers. Pattern Recognition Letters 24, 9 (2003), 1641--1650.

Digital Library

[34]

Dennis Hofmann, Peter VanNostrand, Huayi Zhang, Yizhou Yan, Lei Cao, Samuel Madden, and Elke Rundensteiner. 2022. A Demonstration of AutoOD: A Self-Tuning Anomaly Detection System. Proc. VLDB Endow. 15, 12 (aug 2022), 3706--3709.

Digital Library

[35]

David G Kleinbaum, K Dietz, M Gail, Mitchel Klein, and Mitchell Klein. 2002. Logistic regression. Springer.

[36]

Edwin M. Knorr and Raymond T. Ng. 1998. Algorithms for Mining Distance-Based Outliers in Large Datasets. In VLDB. 392--403.

Digital Library

[37]

Edwin M Knorr and Raymond T Ng. 1999. Finding intensional knowledge of distance-based outliers. In VLDB, Vol. 99. 211--222.

[38]

Lars Kotthoff, Chris Thornton, Holger H Hoos, Frank Hutter, and Kevin Leyton-Brown. 2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. The Journal of Machine Learning Research 18, 1 (2017), 826--830.

Digital Library

[39]

Hans-Peter Kriegel, Matthias Schubert, and Arthur Zimek. 2008. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24--27, 2008. 444--452.

Digital Library

[40]

Aleksandar Lazarevic and Vipin Kumar. 2005. Feature bagging for outlier detection. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, August 21--24, 2005. 157--166.

Digital Library

[41]

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research 18, 1 (2017), 6765--6816.

Digital Library

[42]

Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang. 2018. Ease. ml: Towards multi-tenant resource sharing for machine learning workloads. Proceedings of the VLDB Endowment 11, 5 (2018), 607--620.

Digital Library

[43]

Yuening Li, Daochen Zha, Praveen Venugopal, Na Zou, and Xia Hu. 2020. PyODDS: An end-to-end outlier detection system with automated machine learning. In Companion Proceedings of the Web Conference 2020. 153--157.

Digital Library

[44]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining. IEEE, 413--422.

Digital Library

[45]

Gang Luo. 2016. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Network Modeling Analysis in Health Informatics and Bioinformatics 5, 1 (2016), 18.

[46]

Martin Q. Ma, Yue Zhao, Xiaorong Zhang, and Leman Akoglu. 2021. A Large-scale Study on Unsupervised Outlier Model Selection: Do Internal Strategies Suffice? ArXiv abs/2104.01422 (2021).

[47]

David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. 2014. On the least trimmed squares estimator. Algorithmica 69, 1 (2014), 148--183.

[48]

Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos Faloutsos. 2003. LOCI: Fast Outlier Detection Using the Local Correlation Integral. In Proceedings of the 19th International Conference on Data Engineering, March 5--8, 2003, Bangalore, India. 315--326.

[49]

F. Pedregosa, G. Varoquaux, and etc. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.

Digital Library

[50]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al . 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825--2830.

Digital Library

[51]

Pramuditha Perera, Ramesh Nallapati, and Bing Xiang. 2019. Ocgan: One-class novelty detection using gans with constrained latent representations. In CVPR. 2898--2906.

[52]

Tomá? Pevn

[53]

y. 2016. Loda: Lightweight on-line detector of anomalies. Machine Learning 102, 2 (2016), 275--304.

Digital Library

[54]

John Platt et al . 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10, 3 (1999), 61--74.

[55]

Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient Algorithms for Mining Outliers from Large Data Sets. In SIGMOD. 427--438.

[56]

Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2019. Snorkel: Rapid training data creation with weak supervision. The VLDB Journal (2019), 1--22.

[57]

Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid Training Data Creation with Weak Super. PVLDB 11, 3 (2017), 269--282.

Digital Library

[58]

Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, and Christopher Ré. 2019. Training complex models with multi-task weak supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4763--4771.

Digital Library

[59]

Lior Rokach. 2010. Ensemble-based classifiers. Artif. Intell. Rev. 33, 1--2 (2010), 1--39.

Digital Library

[60]

Mayu Sakurada and Takehisa Yairi. 2014. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis. 4--11.

Digital Library

[61]

Erich Schubert, Remigius Wojdanowski, Arthur Zimek, and Hans-Peter Kriegel. 2012. On Evaluation of Outlier Rankings and Outlier Scores. In SDM. 1047--1058.

[62]

Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD. 1171--1188.

[63]

Yanyao Shen and Sujay Sanghavi. 2019. Learning with Bad Training Data via Iterative Trimmed Loss Minimization. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, Long Beach, California, USA, 5739--5748. http://proceedings.mlr.press/v97/shen19e.html

[64]

Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2015. Automating model search for large scale machine learning. In Proceedings of the Sixth ACM Symposium on Cloud Computing. ACM, 368--380.

Digital Library

[65]

Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 847--855.

Digital Library

[66]

Roman Vershynin. 2010. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010).

[67]

Yan Xia, Xudong Cao, Fang Wen, Gang Hua, and Jian Sun. 2015. Learning discriminative reconstructions for unsuper- vised outlier removal. In Proceedings of the IEEE International Conference on Computer Vision. 1511--1519.

Digital Library

[68]

Yizhou Yan, Lei Cao, and Elke A Rundensteiner. 2017. Scalable Top-n Local Outlier Detection. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1235--1244.

Digital Library

[69]

Houssam Zenati, Chuan Sheng Foo, Bruno Lecouat, Gaurav Manek, and Vijay Ramaseshan Chandrasekhar. 2018. Efficient gan-based anomaly detection. arXiv preprint arXiv:1802.06222 (2018).

[70]

Shuangfei Zhai, Yu Cheng, Weining Lu, and Zhongfei Zhang. 2016. Deep structured energy based models for anomaly detection. arXiv preprint arXiv:1605.07717 (2016).

[71]

Huayi Zhang, Lei Cao, Peter VanNostrand, Samuel Madden, and Elke A. Rundensteiner. 2021. ELITE: Robust Deep Anomaly Detection with Meta Gradient. In KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14--18, 2021. 2174--2182.

[72]

Yue Zhao, Zain Nasrullah, Maciej K Hryniewicki, and Zheng Li. 2019. LSCP: Locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 585--593.

[73]

Yue Zhao, Ryan Rossi, and Leman Akoglu. 2021. Automatic Unsupervised Outlier Model Selection. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 4489--4502.

[74]

Chong Zhou and Randy C Paffenroth. 2017. Anomaly detection with robust deep autoencoders. In SIGKDD. 665--674.

[75]

Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).

Cited By

Giannoulidis ANikolaidis NGounaris A(2024)Parameter-free Streaming Distance-based Outlier Detection2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00019(102-106)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDEW61823.2024.00019

Index Terms

AutoOD: Automatic Outlier Detection
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Data cleaning

Recommendations

On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study

The evaluation of unsupervised outlier detection algorithms is a constant challenge in data mining research. Little is known regarding the strengths and weaknesses of different standard outlier detection models, and the impact of parameter choices for ...
A Neural Probabilistic outlier detection method for categorical data
Abstract
Unsupervised outlier detection for categorical data is important and essential for broad applications in various domains. The complex interactions between attributes and the relevance of attributes make it a stem challenge. Existing ...
Enhancing Outlier Detection by an Outlier Indicator
Machine Learning and Data Mining in Pattern Recognition
Abstract
Outlier detection is an important task in data mining and has high practical value in numerous applications such as astronomical observation, text detection, fraud detection and so on. At present, a large number of popular outlier detection ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 1

PACMMOD

May 2023

2807 pages

EISSN:2836-6573

DOI:10.1145/3603164

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023

Published in PACMMOD Volume 1, Issue 1

Author Tags

Qualifiers

Research-article

Funding Sources

Department of Education
NSF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
1,078
Total Downloads

Downloads (Last 12 months)670
Downloads (Last 6 weeks)46

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Giannoulidis ANikolaidis NGounaris A(2024)Parameter-free Streaming Distance-based Outlier Detection2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00019(102-106)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDEW61823.2024.00019

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents