research-article

RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning

Authors:

Jae-Gil LeeAuthors Info & Claims

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 1173 - 1187

https://doi.org/10.1145/3183713.3196887

Published: 27 May 2018 Publication History

Abstract

In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning, that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary, to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.

References

[1]

Guilherme Andrade, Gabriel Ramos, Daniel Madeira, Rafael Sachetto, Renato Ferreira, and Leonardo Rocha. 2013. G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering. Procedia Computer Science Vol. 18 (2013), 369--378.

[2]

Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering Points to Identify the Clustering Structure Proc. 1999 ACM SIGMOD Int'l Conf. on Management of Data. 49--60.

Digital Library

[3]

Apache. 2017. Spark API Documentation. https://spark.apache.org/docs/2.1.0/api.html. (2017). Accessed: 2017-10-31.

[4]

Domenica Arlia and Massimo Coppola. 2001. Experiments in Parallel Clustering with DBSCAN. Proc. 7th European Conf. on Parallel Processing. 326--331.

Digital Library

[5]

Marsha J. Berger and Shahid H. Bokhari. 1987. A Partitioning Strategy for Nonuniform Problems on Multiprocessors. IEEE Trans. on Computers Vol. C-36, 5 (1987), 570--580.

Digital Library

[6]

Chun-Chieh Chen and Ming-Syan Chen. 2015. HiClus: Highly Scalable Density-based Clustering with Heterogeneous Cloud. Procedia Computer Science Vol. 53 (2015), 149--157.

[7]

Irving Cordova and Teng-Sheng Moh. 2015. DBSCAN on Resilient Distributed Datasets. In Proc. 2015 Int'l Conf. on High Performance Computing & Simulation. 531--540.

[8]

Bi-Ru Dai and I-Chang Lin. 2012. Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition Proc. 2012 IEEE Int'l Conf. on Cloud Computing. 59--66.

Digital Library

[9]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters Proc. 6th Sympo. on Operating System Design and Implementation. 137--150.

Digital Library

[10]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proc. 2nd Int'l Conf. on Knowledge Discovery and Data Mining. 226--231.

Digital Library

[11]

Junhao Gan and Yufei Tao. 2015. DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation Proc. 2015 ACM SIGMOD Int'l Conf. on Management of Data. 519--530.

Digital Library

[12]

Junhao Gan and Yufei Tao. 2017. Dynamic Density Based Clustering. In Proc. 2017 ACM SIGMOD Int'l Conf. on Management of Data. 1493--1507.

Digital Library

[13]

Markus Götz, Christian Bodenstein, and Morris Riedel. 2015. HPDBSCAN: Highly Parallel DBSCAN. In Proc. Workshop on Machine Learning in High-Performance Computing Environments. 2:1--2:10.

Digital Library

[14]

Poonam Goyal, Sonal Kumari, Dhruv Kumar, Sundar Balasubramaniam, Navneet Goyal, Saiyedul Islam, and Jagat Sesh Challa. 2015. Parallelizing OPTICS for Commodity Clusters. In Proc. 2015 Int'l Conf. on Distributed Computing and Networking. 33.

Digital Library

[15]

Ade Gunawan. 2013. A Faster Algorithm for DBSCAN. Master's thesis. Eindhoven University of Technology, the Netherlands.

[16]

Mordechai Haklay and Patrick Weber. 2008. OpenStreetMap: User-Generated Street Maps.

[17]

Dianwei Han, Ankit Agrawal, Wei-Keng Liao, and Alok Choudhary. 2016. A Novel Scalable DBSCAN Algorithm with Spark. Proc. 2016 IEEE Int'l Sympo. on Parallel and Distributed Processing. 1393--1402.

[18]

Yaobin He, Haoyu Tan, Wuman Luo, Shengzhong Feng, and Jianping Fan. 2014. MR-DBSCAN: A Scalable MapReduce-based DBSCAN Algorithm for Heavily Skewed Data. Frontiers of Computer Science Vol. 8, 1 (2014), 83--99.

Digital Library

[19]

Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeifle. 2004. Scalable Density-Based Distributed Clustering. Proc. 8th European Conf. on Principles of Data Mining and Knowledge Discovery (2004), 231--244.

Digital Library

[20]

George Karypis, Eui-Hong Han, and Vipin Kumar. 1999. Chameleon: Hierarchical Clustering Using Dynamic Modeling. Computer, Vol. 32, 8 (1999), 68--75.

Digital Library

[21]

Dexter C. Kozen. 2012. The Design and Analysis of Algorithms. Springer Science & Business Media.

[22]

YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman. 2010. Scalable Clustering Algorithm for N-body Simulations in a Shared-Nothing Cluster Proc. 22nd Int'l Conf. on Scientific and Statistical Database Management. 132--150.

Digital Library

[23]

Alessandro Lulli, Matteo Dell'Amico, Pietro Michiardi, and Laura Ricci. 2016. NG-DBSCAN: Scalable Density-Based Clustering for Arbitrary Data. Proceedings of the VLDB Endowment Vol. 10, 3 (2016), 157--168.

Digital Library

[24]

Guangchun Luo, Xiaoyu Luo, Thomas Fairley Gooch, Ling Tian, and Ke Qin. 2016. A Parallel DBSCAN Algorithm Based on Spark. In Proc. 2016 IEEE Int'l Conf. on Big Data and Cloud Computing. 548--553.

[25]

Son T. Mai, Ira Assent, and Martin Storgaard. 2016. AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets. In Proc. 22nd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 1025--1034.

Digital Library

[26]

Robert Ryan McCune, Tim Weninger, and Greg Madey. 2015. Thinking like a Vertex: A Survey of Vertex-Centric Frameworks for Large-Scale Distributed Graph Processing. Comput. Surveys Vol. 48, 2 (2015), 25.

Digital Library

[27]

Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-Keng Liao, Fredrik Manne, and Alok Choudhary. 2012. A New Scalable Parallel DBSCAN Algorithm Using the Disjoint-Set Data Structure Proc. 2012 Int'l Conf. on High Performance Computing, Networking, Storage and Analysis. 62:1--62:11.

Digital Library

[28]

Mostofa Ali Patwary, Nadathur Satish, Narayanan Sundaram, Fredrik Manne, Salman Habib, and Pradeep Dubey. 2014. PARDICLE: Parallel Approximate Density-based Clustering Proc. 2014 Int'l Conf. on High Performance Computing, Networking, Storage and Analysis. 560--571.

Digital Library

[29]

William M. Rand. 1971. Objective Criteria for the Evaluation of Clustering Methods. J. Amer. Statist. Assoc. Vol. 66, 336 (1971), 846--850.

[30]

Tatsuhiro Sakai, Keiichi Tamura, and Hajime Kitakami. 2017. Cell-Based DBSCAN Algorithm Using Minimum Bounding Rectangle Criteria Proc. 2017 Int'l Conf. on Database Systems for Advanced Applications. 133--144.

[31]

Hwanjun Song, Jae-Gil Lee, and Wook-Shin Han. 2017. PAMAE: Parallel k-Medoids Clustering with High Accuracy and Efficiency Proc. 23rd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 1087--1096.

Digital Library

[32]

Jeffrey Scott Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. on Mathematical Software Vol. 11, 1 (1985), 37--57.

Digital Library

[33]

Larry Wasserman. 2013. All of Statistics: A Concise Course in Statistical Inference. Springer Science &Business Media.

Digital Library

[34]

Benjamin Welton, Evan Samanas, and Barton P. Miller. 2013. Mr. Scan: Extreme Scale Density-Based Clustering Using a Tree-Based Network of GPGPU Nodes Proc. 2013 Int'l Conf. on High Performance Computing, Networking, Storage and Analysis. 84:1--84:11.

Digital Library

[35]

Xiaowei Xu, Jochen J"ager, and Hans-Peter Kriegel. 1999. A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery Vol. 3, 3 (1999), 263--290.

Digital Library

[36]

Yanwei Yu, Jindong Zhao, Xiaodong Wang, Qin Wang, and Yonggang Zhang. 2015. Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop. International Journal of Distributed Sensor Networks Vol. 2015 (2015), 1--13.

Digital Library

[37]

Weizhong Zhao, Huifang Ma, and Qing He. 2009. Parallel K-Means Clustering Based on MapReduce. Proc. 1st Int'l Conf. on Cloud Computing. 674--679.

Digital Library

[38]

Yu Zheng, Like Liu, Longhao Wang, and Xing Xie. 2008. Learning Transportation Mode from Raw GPS Data for Geographic Applications on the Web Proc. 17th Int'l Conf. on World Wide Web. 247--256.

Digital Library

Cited By

Mo GSong SDing H(2024)Towards Metric DBSCAN: Exact, Approximate, and Streaming AlgorithmsProceedings of the ACM on Management of Data10.1145/36549812:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654981
Cesario ELindia PVinci A(2024)A scalable multi-density clustering approach to detect city hotspots in a smart cityFuture Generation Computer Systems10.1016/j.future.2024.03.042157(226-236)Online publication date: Aug-2024
https://doi.org/10.1016/j.future.2024.03.042
Challa JGoyal NSharma ASreekumar NBalasubramaniam SGoyal P(2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11390-024-2700-0
Show More Cited By

Index Terms

RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

A new hybrid method based on partitioning-based DBSCAN and ant clustering

Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based ...
AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
Merging DBSCAN and Density Peak for Robust Clustering
Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series
Abstract
In data clustering, density based algorithms are well known for the ability of detecting clusters of arbitrary shapes. DBSCAN is a widely used density based clustering approach, and the recently proposed density peak algorithm has shown ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

May 2018

1874 pages

ISBN:9781450347037

DOI:10.1145/3183713

General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Research Foundation of Korea
Ministry of Land, Infrastructure and Transport, Korea

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 10 - 15, 2018

TX, Houston, USA

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
1,226
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)3

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mo GSong SDing H(2024)Towards Metric DBSCAN: Exact, Approximate, and Streaming AlgorithmsProceedings of the ACM on Management of Data10.1145/36549812:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654981
Cesario ELindia PVinci A(2024)A scalable multi-density clustering approach to detect city hotspots in a smart cityFuture Generation Computer Systems10.1016/j.future.2024.03.042157(226-236)Online publication date: Aug-2024
https://doi.org/10.1016/j.future.2024.03.042
Challa JGoyal NSharma ASreekumar NBalasubramaniam SGoyal P(2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11390-024-2700-0
An XWang ZWang DLiu SJin CXu XCao J(2023)STRP-DBSCAN: A Parallel DBSCAN Algorithm Based on Spatial-Temporal Random Partitioning for Clustering Trajectory DataApplied Sciences10.3390/app13201112213:20(11122)Online publication date: 10-Oct-2023
https://doi.org/10.3390/app132011122
Mochurad LSydor ARatinskiy O(2023)A fast parallelized DBSCAN algorithm based on OpenMp for detection of criminals on streaming servicesFrontiers in Big Data10.3389/fdata.2023.12929236Online publication date: 31-Oct-2023
https://doi.org/10.3389/fdata.2023.1292923
Amagata DHara T(2023)Efficient Density-Peaks Clustering Algorithms on Static and Dynamic Data in Euclidean SpaceACM Transactions on Knowledge Discovery from Data10.1145/3607873Online publication date: 7-Jul-2023
https://doi.org/10.1145/3607873
Prokopenko ALebrun-Grandie DArndt D(2023)Fast tree-based algorithms for DBSCAN for low-dimensional data on GPUsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605594(503-512)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605594
Huang XMa T(2023)Fast Density-Based Clustering: Geometric ApproachProceedings of the ACM on Management of Data10.1145/35889121:1(1-24)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588912
Mustafa HBarrus CLeal EGruenwald L(2023)GTraclus: a novel algorithm for local trajectory clustering on GPUsDistributed and Parallel Databases10.1007/s10619-023-07429-x41:3(467-488)Online publication date: 13-May-2023
https://doi.org/10.1007/s10619-023-07429-x
Wang JWang YWang JShen YZhao X(2023)Design of an Image Segmentation System Based on Hierarchical Density-Based Spatial Clustering of Applications with Noise for Off-Road Unmanned Ground VehiclesProceedings of 2022 International Conference on Autonomous Unmanned Systems (ICAUS 2022)10.1007/978-981-99-0479-2_291(3163-3175)Online publication date: 10-Mar-2023
https://doi.org/10.1007/978-981-99-0479-2_291
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents