research-article

Public Access

Distributed Local Outlier Detection in Big Data

Authors:

Caitlin Kulhman,

Elke RundensteinerAuthors Info & Claims

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1225 - 1234

https://doi.org/10.1145/3097983.3098179

Published: 13 August 2017 Publication History

Abstract

In this work, we present the first distributed solution for the Local Outlier Factor (LOF) method -- a popular outlier detection technique shown to be very effective for datasets with skewed distributions. As datasets increase radically in size, highly scalable LOF algorithms leveraging modern distributed infrastructures are required. This poses significant challenges due to the complexity of the LOF definition, and a lack of access to the entire dataset at any individual compute machine. Our solution features a distributed LOF pipeline framework, called DLOF. Each stage of the LOF computation is conducted in a fully distributed fashion by leveraging our invariant observation for intermediate value management. Furthermore, we propose a data assignment strategy which ensures that each machine is self-sufficient in all stages of the LOF pipeline, while minimizing the number of data replicas. Based on the convergence property derived from analyzing this strategy in the context of real world datasets, we introduce a number of data-driven optimization strategies. These strategies not only minimize the computation costs within each stage, but also eliminate unnecessary communication costs by aggressively pushing the LOF computation into the early stages of the DLOF pipeline. Our comprehensive experimental study using both real and synthetic datasets confirms the efficiency and scalability of our approach to terabyte level data.

References

[1]

2015. Apache Hadoop. https://hadoop.apache.org/. (2015).

[2]

Foto N. Afrati, Anish Das Sarma, Semih Salihoglu, and Jeffrey D. Ullman. 2013. Upper and Lower Bounds on the Cost of a Map-Reduce Computation. PVLDB 6, 4 (2013), 277--288.

Digital Library

[3]

Charu C. Aggarwal. 2013. Outlier Analysis. Springer.

[4]

Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and Joel H. Saltz. 2013. Hadoop-GIS: A High Performance Spatial Data Ware-housing System over MapReduce. PVLDB 6, 11 (2013), 1009--1020.

Digital Library

[5]

Stephen D. Bay and Mark Schwabacher. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD. 29--38.

Digital Library

[6]

Kanishka Bhaduri, Bryan L. Matthews, and Chris Giannella. 2011. Algorithms for speeding up distance-based outlier detection. In KDD. 859--867.

Digital Library

[7]

Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jürg Sander. 2000. LOF: Identifying Density-based Local Outliers. In SIGMOD. ACM, 93--104.

Digital Library

[8]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.

Digital Library

[9]

Kyle S. Dawson et.al. 2016. The SDSS-IV Extended Baryon Oscillation Spectro- scopic Survey: Overview and Early Data. The Astronomical Journal 151, 2 (2016), 44. http://stacks.iop.org/1538--3881/151/i=2/a=44

[10]

Matei Zaharia et.al. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX. 2--2.

[11]

Mordechai Haklay and Patrick Weber. 2008. Openstreetmap: User-generated street maps. IEEE Pervasive Computing 7, 4 (2008), 12--18.

Digital Library

[12]

Douglas M. Hawkins. 1980. Identification of Outliers. Springer. 1--188 pages.

[13]

Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng, and Jianping Fan. 2011. MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce. In ICPADS. 473--480.

[14]

Edwin M. Knorr and Raymond T. Ng. 1998. Algorithms for Mining Distance-Based Outliers in Large Datasets. In VLDB. 392--403.

Digital Library

[15]

Aleksandar Lazarevic, Levent Ertöz, Vipin Kumar, Aysel Ozgur, and Jaideep Sri- vastava. 2003. A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection. In SDM. SIAM, 25--36.

[16]

Elio Lozano and Edgar Acuña. 2005. Parallel Algorithms for Distance-Based and Density-Based Outliers. In ICDM. 729--732.

Digital Library

[17]

Wei Lu, Yanyan Shen, Su Chen, and Beng Chin Ooi. 2012. Efficient processing of k nearest neighbor joins using mapreduce. PVLDB 5, 10 (2012), 1016--1027.

Digital Library

[18]

Gustavo Henrique Orair, Carlos H. C. Teixeira, Ye Wang, Wagner Meira Jr., and Srinivasan Parthasarathy. 2010. Distance-Based Outlier Detection: Consolidation and Renewed Bearing. PVLDB 3, 2 (2010), 1469--1480.

Digital Library

[19]

Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient Algorithms for Mining Outliers from Large Data Sets. In SIGMOD. 427--438.

Digital Library

[20]

Chi Zhang, Feifei Li, and Jeffrey Jestes. 2012. Efficient parallel kNN joins for large data in MapReduce. In EDBT. 38--49.

Digital Library

Cited By

Salles RLange BAkbarinia RMasseglia FOgasawara EPacitti E(2025)Scalable and accurate online multivariate anomaly detectionInformation Systems10.1016/j.is.2025.102524131(102524)Online publication date: Jun-2025
https://doi.org/10.1016/j.is.2025.102524
Horyń CNowak-Brzezińska A(2025)Automatic block size optimization in the LOF algorithm for efficient anomaly detectionApplied Soft Computing10.1016/j.asoc.2024.112675170(112675)Online publication date: Feb-2025
https://doi.org/10.1016/j.asoc.2024.112675
Mfondoum RIvanov AKoleva PPoulkov VManolova A(2024)Outlier Detection in Streaming Data for Telecommunications and Industrial Applications: A SurveyElectronics10.3390/electronics1316333913:16(3339)Online publication date: 22-Aug-2024
https://doi.org/10.3390/electronics13163339
Show More Cited By

Index Terms

Distributed Local Outlier Detection in Big Data
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Outlier detection using isolation forest and local outlier factor
RACS '19: Proceedings of the Conference on Research in Adaptive and Convergent Systems

Outlier detection, also named as anomaly detection, is one of the hot issues in the field of data mining. As well-known outlier detection algorithms, Isolation Forest(iForest) and Local Outlier Factor(LOF) have been widely used. However, iForest is only ...
Local outlier detection based on information entropy weighting

As a key research area in data mining technologies, outlier detection can expose data inconsistent with the majority in the dataset and therefore is applicable in extensive areas. The addition of entropy weighting to the spatial local outlier measure (...
A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data
PAKDD '09: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining

Detecting outliers which are grossly different from or inconsistent with the remaining dataset is a major challenge in real-world KDD applications. Existing outlier detection methods are ineffective on scattered real-world datasets due to implicit data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2017

2240 pages

ISBN:9781450348874

DOI:10.1145/3097983

General Chairs:
Stan Matwin
Dalhousie University
,
Shipeng Yu
LinkedIn
,
Faisal Farooq
IBM

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

KDD '17

Sponsor:

KDD '17: The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13 - 17, 2017

NS, Halifax, Canada

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
1,722
Total Downloads

Downloads (Last 12 months)143
Downloads (Last 6 weeks)24

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Salles RLange BAkbarinia RMasseglia FOgasawara EPacitti E(2025)Scalable and accurate online multivariate anomaly detectionInformation Systems10.1016/j.is.2025.102524131(102524)Online publication date: Jun-2025
https://doi.org/10.1016/j.is.2025.102524
Horyń CNowak-Brzezińska A(2025)Automatic block size optimization in the LOF algorithm for efficient anomaly detectionApplied Soft Computing10.1016/j.asoc.2024.112675170(112675)Online publication date: Feb-2025
https://doi.org/10.1016/j.asoc.2024.112675
Mfondoum RIvanov AKoleva PPoulkov VManolova A(2024)Outlier Detection in Streaming Data for Telecommunications and Industrial Applications: A SurveyElectronics10.3390/electronics1316333913:16(3339)Online publication date: 22-Aug-2024
https://doi.org/10.3390/electronics13163339
Yang XZhuang YShi MCao XChen DTang Y(2024)SPiForest: An Anomaly Detecting Algorithm Using Space Partition Constructed by Probability Density-Based Inverse SamplingIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.322334235:6(8013-8025)Online publication date: Jun-2024
https://doi.org/10.1109/TNNLS.2022.3223342
Adesh AG SShetty JXu L(2024)Local outlier factor for anomaly detection in HPCC systemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104923192:COnline publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2024.104923
Sereshki MZanjireh MBahaghighat M(2023)Textual outlier detection with an unsupervised method using text similarity and density peakActa Universitatis Sapientiae, Informatica10.2478/ausi-2023-000815:1(91-110)Online publication date: 8-Aug-2023
https://doi.org/10.2478/ausi-2023-0008
Badiang R(2023)Local Outlier Reclassifier (LORec): a Method for Relocating Local Outliers Generated by K-means2023 13th International Conference on Software Technology and Engineering (ICSTE)10.1109/ICSTE61649.2023.00030(143-150)Online publication date: 27-Oct-2023
https://doi.org/10.1109/ICSTE61649.2023.00030
Mustafa HAyob M(2022)Enhanced Connectivity Validity Measure Based on Outlier Detection for Multi-Objective Metaheuristic Data Clustering AlgorithmsApplied Computational Intelligence and Soft Computing10.1155/2022/10362932022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/1036293
Perepu SPinnamaraju V(2022)A novel unsupervised method for root cause analysis of anomalies using sparse optimization techniques2022 10th International Conference on Systems and Control (ICSC)10.1109/ICSC57768.2022.9993819(416-422)Online publication date: 23-Nov-2022
https://doi.org/10.1109/ICSC57768.2022.9993819
Al-Kateb MEltabakh MAl-Omari ABrown P(2022)Analytics at Scale: Evolution at Infrastructure and Algorithmic Levels2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00302(3217-3220)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00302
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten