Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

AutoOD: Automatic Outlier Detection

Published: 30 May 2023 Publication History

Abstract

Outlier detection is critical in real world. Due to the existence of many outlier detection techniques which often return different results for the same data set, the users have to address the problem of determining which among these techniques is the best suited for their task and tune its parameters. This is particularly challenging in the unsupervised setting, where no labels are available for cross-validation needed for such method and parameter optimization. In this work, we propose AutoOD which uses the existing unsupervised detection techniques to automatically produce high quality outliers without any human tuning. AutoOD's fundamentally new strategy unifies the merits of unsupervised outlier detection and supervised classification within one integrated solution. It automatically tests a diverse set of unsupervised outlier detectors on a target data set, extracts useful signals from their combined detection results to reliably capture key differences between outliers and inliers. It then uses these signals to produce a "custom outlier classifier" to classify outliers, with its accuracy comparable to supervised outlier classification models trained with ground truth labels - without having access to the much needed labels. On a diverse set of benchmark outlier detection datasets, AutoOD consistently outperforms the best unsupervised outlier detector selected from hundreds of detectors. It also outperforms other tuning-free approaches from 12 to 97 points (out of 100) in the F-1 score.

Supplemental Material

MP4 File
The video of AutoOD: Automatic Outlier Detection

References

[1]
2017. Forrester Prediction. https://go.forrester.com.
[2]
2022. AutoOD: Automatic Outlier Detection. https://drive.google.com/file/d/1ZQMoySvarne4hP_UfHcs6aB3dhffPyY0/view?usp=sharing.
[3]
Davide Abati, Angelo Porrello, Simone Calderara, and Rita Cucchiara. 2019. Latent space autoregression for novelty detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 481--490.
[4]
Charu C. Aggarwal. 2017. Outlier Analysis: Second Edition. Springer.
[5]
Charu C Aggarwal and Saket Sathe. 2015. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter 17, 1 (2015), 24--47.
[6]
Charu C Aggarwal and Saket Sathe. 2017. Outlier ensembles: An introduction. Springer.
[7]
Samet Akcay, Amir Atapour-Abarghouei, and Toby P Breckon. 2018. Ganomaly: Semi-supervised anomaly detection via adversarial training. In Asian conference on computer vision. Springer, 622--637.
[8]
Sarah Miller Al Pascual, Kyle Marchini. 2017. Identity Fraud: Securing the Connected Life. (2017).
[9]
Jerone TA Andrews, Edward J Morton, and Lewis D Griffin. 2016. Detecting anomalous data using auto-encoders. International Journal of Machine Learning and Computing 6, 1 (2016), 21.
[10]
Fabrizio Angiulli and Clara Pizzuti. 2002. Fast Outlier Detection in High Dimensional Spaces. In PKDD. 15--26.
[11]
Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. Macrobase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 541--556.
[12]
Vic Barnett, Toby Lewis, et al. 1994. Outliers in statistical data. Vol. 3. Wiley New York.
[13]
Laura Beggel, Michael Pfeiffer, and Bernd Bischl. 2019. Robust anomaly detection in images using adversarial autoencoders. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 206--222.
[14]
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of machine learning research 13, Feb (2012), 281--305.
[15]
Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, Eli Upfal, Robert Zeleznik, and Emanuel Zgraggen. 2018. Towards interactive curation & automatic tuning of ml pipelines. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning. 1--4.
[16]
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying Density-Based Local Outliers. In SIGMOD. 93--104.
[17]
Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J. G. B. Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. 2016. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30, 4 (2016), 891--927.
[18]
Lei Cao, Mingrui Wei, Di Yang, and Elke A Rundensteiner. 2015. Online outlier exploration over large datasets. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 89--98.
[19]
Lei Cao, Yizhou Yan, Caitlin Kuhlman, Qingyang Wang, Elke A Rundensteiner, and Mohamed Eltabakh. 2017. Multi-Tactic Distance-Based Outlier Detection. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE, 959--970.
[20]
Chengliang Chai, Lei Cao, Guoliang Li, Jian Li, Yuyu Luo, and Samuel Madden. 2020. Human-in-the-Loop Outlier Detection. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 19--33. https://doi.org/10.1145/3318464.3389772
[21]
Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. 2017. Robust, deep and inductive anomaly detection. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 36--51.
[22]
V. Chandola, A. Banerjee, and V. Kumar. 2009. Anomaly detection. Comput. Surveys 41, 3 (2009), 1--58.
[23]
Jinghui Chen, Saket Sathe, Charu Aggarwal, and Deepak Turaga. 2017. Outlier detection with autoencoder ensembles. In Proceedings of the 2017 SIAM international conference on data mining. SIAM, 90--98.
[24]
Simão Eduardo, Alfredo Nazábal, Christopher K. I. Williams, and Charles Sutton. 2020. Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26--28 August 2020, Online [Palermo, Sicily, Italy]. 4056--4066.
[25]
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. In Advances in neural information processing systems. 2962--2970.
[26]
Jing Gao, Jiawei Han, Jialu Liu, and Chi Wang. 2013. Multi-View Clustering via Joint Nonnegative Matrix Factorization. In SDM, May 2--4, 2013. Austin, Texas, USA. 252--260.
[27]
Izhak Golan and Ran El-Yaniv. 2018. Deep anomaly detection using geometric transformations. In NeurIPS. 9758--9769.
[28]
Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. 2017. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 1487--1495.
[29]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
[30]
Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd Edition. Springer.
[31]
Douglas M Hawkins. 1980. Identification of outliers. Vol. 11. Springer.
[32]
Xin He, Kaiyong Zhao, and Xiaowen Chu. 2019. AutoML: A Survey of the State-of-the-Art. arXiv preprint arXiv:1908.00709 (2019).
[33]
Zengyou He, Xiaofei Xu, and Shengchun Deng. 2003. Discovering cluster-based local outliers. Pattern Recognition Letters 24, 9 (2003), 1641--1650.
[34]
Dennis Hofmann, Peter VanNostrand, Huayi Zhang, Yizhou Yan, Lei Cao, Samuel Madden, and Elke Rundensteiner. 2022. A Demonstration of AutoOD: A Self-Tuning Anomaly Detection System. Proc. VLDB Endow. 15, 12 (aug 2022), 3706--3709.
[35]
David G Kleinbaum, K Dietz, M Gail, Mitchel Klein, and Mitchell Klein. 2002. Logistic regression. Springer.
[36]
Edwin M. Knorr and Raymond T. Ng. 1998. Algorithms for Mining Distance-Based Outliers in Large Datasets. In VLDB. 392--403.
[37]
Edwin M Knorr and Raymond T Ng. 1999. Finding intensional knowledge of distance-based outliers. In VLDB, Vol. 99. 211--222.
[38]
Lars Kotthoff, Chris Thornton, Holger H Hoos, Frank Hutter, and Kevin Leyton-Brown. 2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. The Journal of Machine Learning Research 18, 1 (2017), 826--830.
[39]
Hans-Peter Kriegel, Matthias Schubert, and Arthur Zimek. 2008. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24--27, 2008. 444--452.
[40]
Aleksandar Lazarevic and Vipin Kumar. 2005. Feature bagging for outlier detection. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, August 21--24, 2005. 157--166.
[41]
Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research 18, 1 (2017), 6765--6816.
[42]
Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang. 2018. Ease. ml: Towards multi-tenant resource sharing for machine learning workloads. Proceedings of the VLDB Endowment 11, 5 (2018), 607--620.
[43]
Yuening Li, Daochen Zha, Praveen Venugopal, Na Zou, and Xia Hu. 2020. PyODDS: An end-to-end outlier detection system with automated machine learning. In Companion Proceedings of the Web Conference 2020. 153--157.
[44]
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining. IEEE, 413--422.
[45]
Gang Luo. 2016. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Network Modeling Analysis in Health Informatics and Bioinformatics 5, 1 (2016), 18.
[46]
Martin Q. Ma, Yue Zhao, Xiaorong Zhang, and Leman Akoglu. 2021. A Large-scale Study on Unsupervised Outlier Model Selection: Do Internal Strategies Suffice? ArXiv abs/2104.01422 (2021).
[47]
David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. 2014. On the least trimmed squares estimator. Algorithmica 69, 1 (2014), 148--183.
[48]
Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos Faloutsos. 2003. LOCI: Fast Outlier Detection Using the Local Correlation Integral. In Proceedings of the 19th International Conference on Data Engineering, March 5--8, 2003, Bangalore, India. 315--326.
[49]
F. Pedregosa, G. Varoquaux, and etc. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[50]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al . 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825--2830.
[51]
Pramuditha Perera, Ramesh Nallapati, and Bing Xiang. 2019. Ocgan: One-class novelty detection using gans with constrained latent representations. In CVPR. 2898--2906.
[52]
Tomá? Pevn
[53]
y. 2016. Loda: Lightweight on-line detector of anomalies. Machine Learning 102, 2 (2016), 275--304.
[54]
John Platt et al . 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10, 3 (1999), 61--74.
[55]
Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient Algorithms for Mining Outliers from Large Data Sets. In SIGMOD. 427--438.
[56]
Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2019. Snorkel: Rapid training data creation with weak supervision. The VLDB Journal (2019), 1--22.
[57]
Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid Training Data Creation with Weak Super. PVLDB 11, 3 (2017), 269--282.
[58]
Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, and Christopher Ré. 2019. Training complex models with multi-task weak supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4763--4771.
[59]
Lior Rokach. 2010. Ensemble-based classifiers. Artif. Intell. Rev. 33, 1--2 (2010), 1--39.
[60]
Mayu Sakurada and Takehisa Yairi. 2014. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis. 4--11.
[61]
Erich Schubert, Remigius Wojdanowski, Arthur Zimek, and Hans-Peter Kriegel. 2012. On Evaluation of Outlier Rankings and Outlier Scores. In SDM. 1047--1058.
[62]
Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD. 1171--1188.
[63]
Yanyao Shen and Sujay Sanghavi. 2019. Learning with Bad Training Data via Iterative Trimmed Loss Minimization. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, Long Beach, California, USA, 5739--5748. http://proceedings.mlr.press/v97/shen19e.html
[64]
Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2015. Automating model search for large scale machine learning. In Proceedings of the Sixth ACM Symposium on Cloud Computing. ACM, 368--380.
[65]
Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 847--855.
[66]
Roman Vershynin. 2010. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010).
[67]
Yan Xia, Xudong Cao, Fang Wen, Gang Hua, and Jian Sun. 2015. Learning discriminative reconstructions for unsuper- vised outlier removal. In Proceedings of the IEEE International Conference on Computer Vision. 1511--1519.
[68]
Yizhou Yan, Lei Cao, and Elke A Rundensteiner. 2017. Scalable Top-n Local Outlier Detection. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1235--1244.
[69]
Houssam Zenati, Chuan Sheng Foo, Bruno Lecouat, Gaurav Manek, and Vijay Ramaseshan Chandrasekhar. 2018. Efficient gan-based anomaly detection. arXiv preprint arXiv:1802.06222 (2018).
[70]
Shuangfei Zhai, Yu Cheng, Weining Lu, and Zhongfei Zhang. 2016. Deep structured energy based models for anomaly detection. arXiv preprint arXiv:1605.07717 (2016).
[71]
Huayi Zhang, Lei Cao, Peter VanNostrand, Samuel Madden, and Elke A. Rundensteiner. 2021. ELITE: Robust Deep Anomaly Detection with Meta Gradient. In KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14--18, 2021. 2174--2182.
[72]
Yue Zhao, Zain Nasrullah, Maciej K Hryniewicki, and Zheng Li. 2019. LSCP: Locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 585--593.
[73]
Yue Zhao, Ryan Rossi, and Leman Akoglu. 2021. Automatic Unsupervised Outlier Model Selection. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 4489--4502.
[74]
Chong Zhou and Randy C Paffenroth. 2017. Anomaly detection with robust deep autoencoders. In SIGKDD. 665--674.
[75]
Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).

Cited By

View all
  • (2024)Parameter-free Streaming Distance-based Outlier Detection2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00019(102-106)Online publication date: 13-May-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 1
PACMMOD
May 2023
2807 pages
EISSN:2836-6573
DOI:10.1145/3603164
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023
Published in PACMMOD Volume 1, Issue 1

Author Tags

  1. automatic tuning
  2. supervised classification
  3. unsupervised outlier detection

Qualifiers

  • Research-article

Funding Sources

  • Department of Education
  • NSF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)670
  • Downloads (Last 6 weeks)46
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Parameter-free Streaming Distance-based Outlier Detection2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00019(102-106)Online publication date: 13-May-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media