Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Towards Metric DBSCAN: Exact, Approximate, and Streaming Algorithms

Published: 30 May 2024 Publication History

Abstract

DBSCAN is a popular density-based clustering algorithm that has many different applications in practice. However, the running time of DBSCAN in high-dimensional space or general metric space (\em e.g., clustering a set of texts by using edit distance) can be as large as quadratic in the input size. Moreover, most of existing accelerating techniques for DBSCAN are only available for low-dimensional Euclidean space. In this paper, we study the DBSCAN problem under the assumption that the inliers (the core points and border points) have a low intrinsic dimension (which is a realistic assumption for many high-dimensional applications), where the outliers can locate anywhere in the space without any assumption. First, we propose a k-center clustering based algorithm that can reduce the time-consuming labeling and merging tasks of DBSCAN to be linear. Further, we propose a linear time approximate DBSCAN algorithm, where the key idea is building a novel small-size summary for the core points. Also, our algorithm can be efficiently implemented for streaming data and the required memory is independent of the input size. Finally, we conduct our experiments and compare our algorithms with several popular DBSCAN algorithms. The experimental results suggest that our proposed approach can significantly reduce the computational complexity in practice.

References

[1]
Pankaj K Agarwal, Herbert Edelsbrunner, Otfried Schwarzkopf, and Emo Welzl. Euclidean minimum spanning trees and bichromatic closest pairs. In Proceedings of the sixth annual symposium on Computational geometry, pages 203--210, 1990.
[2]
Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. Optics: Ordering points to identify the clustering structure. ACM Sigmod record, 28(2):49--60, 1999.
[3]
Artem Babenko and Victor Lempitsky. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2055--2063, 2016.
[4]
Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97--104, 2006.
[5]
Panthadeep Bhattacharjee and Pinaki Mitra. A survey of density based clustering algorithms. Frontiers of Computer Science, 15:1--27, 2021.
[6]
Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18--42, 2017.
[7]
Brian Brost, Rishabh Mehrotra, and Tristan Jehan. The music streaming sessions dataset. In Proceedings of the 2019 Web Conference. ACM, 2019.
[8]
Matthias Carnein and Heike Trautmann. evostream--evolutionary stream clustering utilizing idle times. Big data research, 14:101--111, 2018.
[9]
Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. Solving k-center clustering (with outliers) in mapreduce and streaming, almost as accurately as sequentially. Proceedings of the VLDB Endowment, 12, 2019.
[10]
Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):1--58, 2009.
[11]
Danny Z Chen, Michiel Smid, and Bin Xu. Geometric algorithms for density-based data clustering. International Journal of Computational Geometry & Applications, 15(03):239--260, 2005.
[12]
Yixin Chen and Li Tu. Density-based clustering for real-time stream data. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133--142, 2007.
[13]
Kenneth L Clarkson. Building triangulations using ??-nets. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 326--335, 2006.
[14]
Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence, 24(5):603--619, 2002.
[15]
Mark de Berg, Ade Gunawan, and Marcel Roeloffzen. Faster dbscan and hdbscan in low-dimensional euclidean spaces. International Journal of Computational Geometry & Applications, 29(01):21--47, 2019.
[16]
Hu Ding, Haikuo Yu, and Zixiu Wang. Greedy strategy works for k-center clustering with outliers and coreset construction. In 27th Annual European Symposium on Algorithms (ESA 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.
[17]
Hu Ding, Fan Yang, and Mingyue Wang. On metric dbscan with low doubling dimension. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3080--3086, 2021.
[18]
William B Dolan, Chris Quirk, and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
[19]
Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
[20]
Yury Elkin and Vitaliy Kurlin. A new near-linear time algorithm for k-nearest neighbor search using a compressed cover tree. In International Conference on Machine Learning, pages 9267--9311. PMLR, 2023.
[21]
Jeff Erickson. New lower bounds for hopcroft's problem. In Proceedings of the eleventh annual symposium on Computational geometry, pages 127--137, 1995.
[22]
Jeff Erickson. On the relative complexities of some geometric problems. In CCCG, volume 95, pages 85--90, 1995.
[23]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226--231, 1996.
[24]
Jianqing Fan, Fang Han, and Han Liu. Challenges of big data analysis. National science review, 1(2):293--314, 2014.
[25]
Hendrik Fichtenberger, Marc Gillé, Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Bico: Birch meets coresets for k-means clustering. In Algorithms--ESA 2013: 21st Annual European Symposium, Sophia Antipolis, France, September 2--4, 2013. Proceedings 21, pages 481--492. Springer, 2013.
[26]
Junhao Gan and Yufei Tao. Dbscan revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 519--530, 2015.
[27]
Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical computer science, 38: 293--306, 1985.
[28]
A Gunawan and M de Berg. A faster algorithm for dbscan, master' s thesis. Technical University of Eindhoven, 2013.
[29]
Anupam Gupta, Robert Krauthgamer, and James R Lee. Bounded geometries, fractals, and low-distortion embeddings. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 534--543. IEEE, 2003.
[30]
H Altay Guvenir, Burak Acar, Gulsen Demiroz, and Ayhan Cekin. A supervised machine learning algorithm for arrhythmia analysis. In Computers in Cardiology 1997, pages 433--436. IEEE, 1997.
[31]
Michael Hahsler and Matthew Bolaños. Clustering data streams based on shared density between micro-clusters. IEEE Transactions on Knowledge and Data Engineering, 28(6):1449--1461, 2016.
[32]
Sariel Har-Peled and Manor Mendel. Fast construction of nets in low-dimensional metrics and their applications. SIAM J. Comput., 35(5):1148--1184, 2006.
[33]
Yaobin He, Haoyu Tan, Wuman Luo, Shengzhong Feng, and Jianping Fan. Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Frontiers of Computer Science, 8:83--99, 2014.
[34]
Dorit S Hochbaum and David B Shmoys. A unified approach to approximation algorithms for bottleneck problems. Journal of the ACM (JACM), 33(3):533--550, 1986.
[35]
Michael E Houle. Characterizing adversarial subspaces using local intrinsic dimensionality. In 6th International Conference on Learning Representations (ICLR 2018), CoRR abs/1801.02613, volume 6, pages 1--15, 2018.
[36]
Lingxiao Huang, Shaofeng H.-C. Jiang, Jian Li, and XuanWu. Epsilon-coresets for clustering (with outliers) in doubling metrics. In Mikkel Thorup, editor, 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7--9, 2018, pages 814--825. IEEE Computer Society, 2018.
[37]
Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2:193--218, 1985.
[38]
Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550--554, 1994.
[39]
Mike Izbicki and Christian Shelton. Faster cover trees. In International Conference on Machine Learning, pages 1162--1170. PMLR, 2015.
[40]
Jennifer Jang and Heinrich Jiang. Dbscan: Towards fast and scalable density clustering. In International conference on machine learning, pages 3019--3029. PMLR, 2019.
[41]
Jennifer Jang and Heinrich Jiang. Meanshift: Extremely fast mode-seeking with applications to segmentation and object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4102--4113, 2021.
[42]
Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117--128, 2010.
[43]
Robert Krauthgamer and James R Lee. Navigating nets: Simple algorithms for proximity search. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pages 798--807. Citeseer, 2004.
[44]
Hans-Peter Kriegel, Peer Kröger, Jörg Sander, and Arthur Zimek. Density-based clustering. Wiley interdisciplinary reviews: data mining and knowledge discovery, 1(3):231--240, 2011.
[45]
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009.
[46]
Brian Kulis and Michael I Jordan. Revisiting k-means: New algorithms via bayesian nonparametrics. arXiv preprint arXiv:1111.0352, 2011.
[47]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.
[48]
Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707--710. Soviet Union, 1966.
[49]
Alessandro Lulli, Matteo Dell'Amico, Pietro Michiardi, and Laura Ricci. Ng-dbscan: scalable density-based clustering for arbitrary data. Proceedings of the VLDB Endowment, 10(3):157--168, 2016.
[50]
Kamel Mansouri, Tine Ringsted, Davide Ballabio, Roberto Todeschini, and Viviana Consonni. Quantitative structure-- activity relationship models for ready biodegradability of chemicals. Journal of chemical information and modeling, 53 :867--878, 2013.
[51]
Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering. J. Open Source Softw., 2 (11):205, 2017.
[52]
Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31--88, 2001.
[53]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532--1543, 2014.
[54]
James C Robinson. Dimensions, embeddings, and attractors, volume 186. Cambridge University Press, 2010.
[55]
Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks. science, 344(6191):1492--1496, 2014.
[56]
Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290 (5500):2323--2326, 2000.
[57]
Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proceedings of the 50th annual ACM SIGACT symposium on theory of computing, pages 1260--1268, 2018.
[58]
Aditya Sarma, Poonam Goyal, Sonal Kumari, Anand Wani, Jagat Sesh Challa, Saiyedul Islam, and Navneet Goyal. ??dbscan: an exact scalable dbscan algorithm for big data exploiting spatial locality. In 2019 IEEE International Conference on Cluster Computing (CLUSTER), pages 1--11. IEEE, 2019.
[59]
Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Transactions on Database Systems (TODS), 42(3):1--21, 2017.
[60]
scikit-learn developers. scikit learn. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons. html#sklearn.datasets.make_moons, 2007--2023.
[61]
Hwanjun Song and Jae-Gil Lee. Rp-dbscan: A superfast parallel dbscan algorithm based on random partitioning. In Proceedings of the 2018 International Conference on Management of Data, pages 1173--1187, 2018.
[62]
Kunal Talwar. Bypassing the embedding: algorithms for low dimensional metrics. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 281--290, 2004.
[63]
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to data mining. Pearson Education India, 2016.
[64]
Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation equivariant cnns for digital pathology. In Medical Image Computing and Computer Assisted Intervention--MICCAI 2018: 21st International Conference, Granada, Spain, September 16--20, 2018, Proceedings, Part II 11, pages 210--218. Springer, 2018.
[65]
Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th annual international conference on machine learning, pages 1073--1080, 2009.
[66]
Yiqiu Wang, Yan Gu, and Julian Shun. Theoretically-efficient and practical parallel dbscan. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 2555--2571, 2020.
[67]
Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625--641, 2019.
[68]
Sandamal Weerasinghe, Tansu Alpcan, Sarah M Erfani, and Christopher Leckie. Defending support vector machines against data poisoning attacks. IEEE Transactions on Information Forensics and Security, 16:2566--2578, 2021.
[69]
Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112--1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18--1101.
[70]
Yi-Pu Wu, Jin-Jiang Guo, and Xue-Jie Zhang. A linear dbscan algorithm based on lsh. In 2007 International Conference on Machine Learning and Cybernetics, volume 5, pages 2608--2614. IEEE, 2007.
[71]
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
[72]
Keyu Yang, Yunjun Gao, Rui Ma, Lu Chen, Sai Wu, and Gang Chen. Dbscan-ms: distributed density-based clustering in metric spaces. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1346--1357. IEEE, 2019.
[73]
Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
[74]
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.

Index Terms

  1. Towards Metric DBSCAN: Exact, Approximate, and Streaming Algorithms

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 2, Issue 3
    SIGMOD
    June 2024
    1953 pages
    EISSN:2836-6573
    DOI:10.1145/3670010
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2024
    Published in PACMMOD Volume 2, Issue 3

    Permissions

    Request permissions for this article.

    Author Tags

    1. approximation
    2. density-based clustering
    3. doubling dimension
    4. k-center clustering
    5. outliers
    6. streaming

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 66
      Total Downloads
    • Downloads (Last 12 months)66
    • Downloads (Last 6 weeks)19
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media