research-article

Towards Metric DBSCAN: Exact, Approximate, and Streaming Algorithms

Authors:

Hu DingAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 3

Article No.: 178, Pages 1 - 25

https://doi.org/10.1145/3654981

Published: 30 May 2024 Publication History

Abstract

DBSCAN is a popular density-based clustering algorithm that has many different applications in practice. However, the running time of DBSCAN in high-dimensional space or general metric space (\em e.g., clustering a set of texts by using edit distance) can be as large as quadratic in the input size. Moreover, most of existing accelerating techniques for DBSCAN are only available for low-dimensional Euclidean space. In this paper, we study the DBSCAN problem under the assumption that the inliers (the core points and border points) have a low intrinsic dimension (which is a realistic assumption for many high-dimensional applications), where the outliers can locate anywhere in the space without any assumption. First, we propose a k-center clustering based algorithm that can reduce the time-consuming labeling and merging tasks of DBSCAN to be linear. Further, we propose a linear time approximate DBSCAN algorithm, where the key idea is building a novel small-size summary for the core points. Also, our algorithm can be efficiently implemented for streaming data and the required memory is independent of the input size. Finally, we conduct our experiments and compare our algorithms with several popular DBSCAN algorithms. The experimental results suggest that our proposed approach can significantly reduce the computational complexity in practice.

References

[1]

Pankaj K Agarwal, Herbert Edelsbrunner, Otfried Schwarzkopf, and Emo Welzl. Euclidean minimum spanning trees and bichromatic closest pairs. In Proceedings of the sixth annual symposium on Computational geometry, pages 203--210, 1990.

Digital Library

[2]

Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. Optics: Ordering points to identify the clustering structure. ACM Sigmod record, 28(2):49--60, 1999.

Digital Library

[3]

Artem Babenko and Victor Lempitsky. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2055--2063, 2016.

[4]

Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97--104, 2006.

Digital Library

[5]

Panthadeep Bhattacharjee and Pinaki Mitra. A survey of density based clustering algorithms. Frontiers of Computer Science, 15:1--27, 2021.

Digital Library

[6]

Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18--42, 2017.

[7]

Brian Brost, Rishabh Mehrotra, and Tristan Jehan. The music streaming sessions dataset. In Proceedings of the 2019 Web Conference. ACM, 2019.

Digital Library

[8]

Matthias Carnein and Heike Trautmann. evostream--evolutionary stream clustering utilizing idle times. Big data research, 14:101--111, 2018.

[9]

Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. Solving k-center clustering (with outliers) in mapreduce and streaming, almost as accurately as sequentially. Proceedings of the VLDB Endowment, 12, 2019.

[10]

Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):1--58, 2009.

[11]

Danny Z Chen, Michiel Smid, and Bin Xu. Geometric algorithms for density-based data clustering. International Journal of Computational Geometry & Applications, 15(03):239--260, 2005.

[12]

Yixin Chen and Li Tu. Density-based clustering for real-time stream data. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133--142, 2007.

Digital Library

[13]

Kenneth L Clarkson. Building triangulations using ??-nets. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 326--335, 2006.

Digital Library

[14]

Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence, 24(5):603--619, 2002.

Digital Library

[15]

Mark de Berg, Ade Gunawan, and Marcel Roeloffzen. Faster dbscan and hdbscan in low-dimensional euclidean spaces. International Journal of Computational Geometry & Applications, 29(01):21--47, 2019.

[16]

Hu Ding, Haikuo Yu, and Zixiu Wang. Greedy strategy works for k-center clustering with outliers and coreset construction. In 27th Annual European Symposium on Algorithms (ESA 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.

[17]

Hu Ding, Fan Yang, and Mingyue Wang. On metric dbscan with low doubling dimension. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3080--3086, 2021.

Digital Library

[18]

William B Dolan, Chris Quirk, and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.

[19]

Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

[20]

Yury Elkin and Vitaliy Kurlin. A new near-linear time algorithm for k-nearest neighbor search using a compressed cover tree. In International Conference on Machine Learning, pages 9267--9311. PMLR, 2023.

[21]

Jeff Erickson. New lower bounds for hopcroft's problem. In Proceedings of the eleventh annual symposium on Computational geometry, pages 127--137, 1995.

Digital Library

[22]

Jeff Erickson. On the relative complexities of some geometric problems. In CCCG, volume 95, pages 85--90, 1995.

[23]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226--231, 1996.

[24]

Jianqing Fan, Fang Han, and Han Liu. Challenges of big data analysis. National science review, 1(2):293--314, 2014.

[25]

Hendrik Fichtenberger, Marc Gillé, Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Bico: Birch meets coresets for k-means clustering. In Algorithms--ESA 2013: 21st Annual European Symposium, Sophia Antipolis, France, September 2--4, 2013. Proceedings 21, pages 481--492. Springer, 2013.

[26]

Junhao Gan and Yufei Tao. Dbscan revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 519--530, 2015.

Digital Library

[27]

Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical computer science, 38: 293--306, 1985.

[28]

A Gunawan and M de Berg. A faster algorithm for dbscan, master' s thesis. Technical University of Eindhoven, 2013.

[29]

Anupam Gupta, Robert Krauthgamer, and James R Lee. Bounded geometries, fractals, and low-distortion embeddings. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 534--543. IEEE, 2003.

[30]

H Altay Guvenir, Burak Acar, Gulsen Demiroz, and Ayhan Cekin. A supervised machine learning algorithm for arrhythmia analysis. In Computers in Cardiology 1997, pages 433--436. IEEE, 1997.

[31]

Michael Hahsler and Matthew Bolaños. Clustering data streams based on shared density between micro-clusters. IEEE Transactions on Knowledge and Data Engineering, 28(6):1449--1461, 2016.

Digital Library

[32]

Sariel Har-Peled and Manor Mendel. Fast construction of nets in low-dimensional metrics and their applications. SIAM J. Comput., 35(5):1148--1184, 2006.

Digital Library

[33]

Yaobin He, Haoyu Tan, Wuman Luo, Shengzhong Feng, and Jianping Fan. Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Frontiers of Computer Science, 8:83--99, 2014.

Digital Library

[34]

Dorit S Hochbaum and David B Shmoys. A unified approach to approximation algorithms for bottleneck problems. Journal of the ACM (JACM), 33(3):533--550, 1986.

[35]

Michael E Houle. Characterizing adversarial subspaces using local intrinsic dimensionality. In 6th International Conference on Learning Representations (ICLR 2018), CoRR abs/1801.02613, volume 6, pages 1--15, 2018.

[36]

Lingxiao Huang, Shaofeng H.-C. Jiang, Jian Li, and XuanWu. Epsilon-coresets for clustering (with outliers) in doubling metrics. In Mikkel Thorup, editor, 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7--9, 2018, pages 814--825. IEEE Computer Society, 2018.

[37]

Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2:193--218, 1985.

[38]

Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550--554, 1994.

Digital Library

[39]

Mike Izbicki and Christian Shelton. Faster cover trees. In International Conference on Machine Learning, pages 1162--1170. PMLR, 2015.

[40]

Jennifer Jang and Heinrich Jiang. Dbscan: Towards fast and scalable density clustering. In International conference on machine learning, pages 3019--3029. PMLR, 2019.

[41]

Jennifer Jang and Heinrich Jiang. Meanshift: Extremely fast mode-seeking with applications to segmentation and object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4102--4113, 2021.

[42]

Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117--128, 2010.

[43]

Robert Krauthgamer and James R Lee. Navigating nets: Simple algorithms for proximity search. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pages 798--807. Citeseer, 2004.

[44]

Hans-Peter Kriegel, Peer Kröger, Jörg Sander, and Arthur Zimek. Density-based clustering. Wiley interdisciplinary reviews: data mining and knowledge discovery, 1(3):231--240, 2011.

[45]

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009.

[46]

Brian Kulis and Michael I Jordan. Revisiting k-means: New algorithms via bayesian nonparametrics. arXiv preprint arXiv:1111.0352, 2011.

[47]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.

[48]

Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707--710. Soviet Union, 1966.

[49]

Alessandro Lulli, Matteo Dell'Amico, Pietro Michiardi, and Laura Ricci. Ng-dbscan: scalable density-based clustering for arbitrary data. Proceedings of the VLDB Endowment, 10(3):157--168, 2016.

Digital Library

[50]

Kamel Mansouri, Tine Ringsted, Davide Ballabio, Roberto Todeschini, and Viviana Consonni. Quantitative structure-- activity relationship models for ready biodegradability of chemicals. Journal of chemical information and modeling, 53 :867--878, 2013.

[51]

Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering. J. Open Source Softw., 2 (11):205, 2017.

[52]

Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31--88, 2001.

[53]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532--1543, 2014.

[54]

James C Robinson. Dimensions, embeddings, and attractors, volume 186. Cambridge University Press, 2010.

[55]

Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks. science, 344(6191):1492--1496, 2014.

[56]

Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290 (5500):2323--2326, 2000.

[57]

Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proceedings of the 50th annual ACM SIGACT symposium on theory of computing, pages 1260--1268, 2018.

Digital Library

[58]

Aditya Sarma, Poonam Goyal, Sonal Kumari, Anand Wani, Jagat Sesh Challa, Saiyedul Islam, and Navneet Goyal. ??dbscan: an exact scalable dbscan algorithm for big data exploiting spatial locality. In 2019 IEEE International Conference on Cluster Computing (CLUSTER), pages 1--11. IEEE, 2019.

[59]

Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Transactions on Database Systems (TODS), 42(3):1--21, 2017.

[60]

scikit-learn developers. scikit learn. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons. html#sklearn.datasets.make_moons, 2007--2023.

[61]

Hwanjun Song and Jae-Gil Lee. Rp-dbscan: A superfast parallel dbscan algorithm based on random partitioning. In Proceedings of the 2018 International Conference on Management of Data, pages 1173--1187, 2018.

Digital Library

[62]

Kunal Talwar. Bypassing the embedding: algorithms for low dimensional metrics. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 281--290, 2004.

Digital Library

[63]

Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to data mining. Pearson Education India, 2016.

[64]

Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation equivariant cnns for digital pathology. In Medical Image Computing and Computer Assisted Intervention--MICCAI 2018: 21st International Conference, Granada, Spain, September 16--20, 2018, Proceedings, Part II 11, pages 210--218. Springer, 2018.

[65]

Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th annual international conference on machine learning, pages 1073--1080, 2009.

Digital Library

[66]

Yiqiu Wang, Yan Gu, and Julian Shun. Theoretically-efficient and practical parallel dbscan. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 2555--2571, 2020.

Digital Library

[67]

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625--641, 2019.

[68]

Sandamal Weerasinghe, Tansu Alpcan, Sarah M Erfani, and Christopher Leckie. Defending support vector machines against data poisoning attacks. IEEE Transactions on Information Forensics and Security, 16:2566--2578, 2021.

[69]

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112--1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18--1101.

[70]

Yi-Pu Wu, Jin-Jiang Guo, and Xue-Jie Zhang. A linear dbscan algorithm based on lsh. In 2007 International Conference on Machine Learning and Cybernetics, volume 5, pages 2608--2614. IEEE, 2007.

[71]

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.

[72]

Keyu Yang, Yunjun Gao, Rui Ma, Lu Chen, Sai Wu, and Gang Chen. Dbscan-ms: distributed density-based clustering in metric spaces. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1346--1357. IEEE, 2019.

[73]

Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.

[74]

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.

Index Terms

Towards Metric DBSCAN: Exact, Approximate, and Streaming Algorithms
1. Theory of computation
  1. Theory and algorithms for application domains

Recommendations

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
A new hybrid method based on partitioning-based DBSCAN and ant clustering

Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based ...
Exact, Fast and Scalable Parallel DBSCAN for Commodity Platforms
ICDCN '17: Proceedings of the 18th International Conference on Distributed Computing and Networking

DBSCAN is one of the most popular density-based clustering algorithm capable of identifying arbitrary shaped clusters and noise. It is computationally expensive for large data sets. In this paper, we present a grid-based DBSCAN algorithm, GridDBSCAN, ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 3

SIGMOD

June 2024

1953 pages

EISSN:2836-6573

DOI:10.1145/3670010

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2024

Published in PACMMOD Volume 2, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
66
Total Downloads

Downloads (Last 12 months)66
Downloads (Last 6 weeks)19

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents