research-article

A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data

Authors:

Nizar Bouguila,

HaiLin LiAuthors Info & Claims

Volume 83, Issue C

Pages 375 - 387

https://doi.org/10.1016/j.patcog.2018.05.030

Published: 01 November 2018 Publication History

Highlights

•

The underlying idea is: point p and point q should have similar neighbors, provided p and q are close to each other; given a certain eps, the closer they are, the more similar their neighbors are.

•

NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same. While ρ-Approximate DBSCAN is an approximate algorithm.

•

The best complexity of NQ-DBSCAN can be O(n), and the average complexity of NQ-DBSCAN is proved to be O(n log(n)) provided the parameters are properly chosen. While ρ-Approximate DBSCAN runs only in O(n²) in high dimension.

•

NQ-DBSCAN is suitable for clustering data with a lot of noise.

Abstract

Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n ²) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n ²) algorithm in very high dimension such that 2^D >  > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.

References

[1]

J. Song, L. Gao, F. Nie, H.T. Shen, Y. Yan, N. Sebe, Optimized graph learning using partial tags and multiple features for image and video annotation, IEEE Trans. Image Process. 25 (11) (2016) 4999–5011.

[2]

J. Song, L. Gao, F. Zou, Y. Yan, N. Sebe, Deep and fast: deep learning hashing with semi-supervised graph construction, Image Vision Comput. (2016).

[3]

J. Song, H.T. Shen, J. Wang, Z. Huang, N. Sebe, J. Wang, A distance-computation-free search scheme for binary code databases, IEEE Trans. Multimedia 18 (3) (2016) 484–495.

[4]

W. Zhou, M. Yang, X. Wang, H. Li, Y. Lin, Q. Tian, Scalable feature matching by dual cascaded scalar quantization for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 38 (1) (2016) 159–171.

[5]

S. Zhang, Q. Huang, S. Jiang, W. Gao, Q. Tian, Affective visualization and retrieval for music video, IEEE Trans. Multimedia 12 (6) (2010) 510–522.

[6]

A.K. Rajagopal, R. Subramanian, E. Ricci, R.L. Vieriu, O. Lanz, N. Sebe, et al., Exploring transfer learning approaches for head pose classification from multi-view surveillance images, Int. J. Comput. Vision 109 (1-2) (2014) 146–167.

[7]

Y. Yan, E. Ricci, R. Subramanian, G. Liu, O. Lanz, N. Sebe, A multi-task learning framework for head pose estimation under target motion, IEEE Trans. Pattern Anal. Mach. Intell. 38 (6) (2016) 1070–1083.

[8]

B.F. Qaqish, J.J. OBrien, J.C. Hibbard, K.J. Clowers, Accelerating high dimensional clustering with lossless data reduction, Bioinformatics (2017) btx328.

[9]

Z. Deng, K.-S. Choi, Y. Jiang, J. Wang, S. Wang, A survey on soft subspace clustering, Inf. Sci. 348 (2016) 84–106.

Digital Library

[10]

O. Limwattanapibool, S. Arch-int, Determination of the appropriate parameters for k-means clustering using selection of region clusters based on density dbscan (srcd-dbscan), Expert Syst. (2017).

[11]

L. Bai, X. Cheng, J. Liang, H. Shen, Y. Guo, Fast density clustering strategies based on the k-means algorithm, Pattern Recognit. 71 (2017) 375–386.

[12]

N.A. Yousri, M.S. Kamel, M.A. Ismail, A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities, Pattern Recognit. 42 (7) (2009) 1193–1209.

[13]

C. Zhong, D. Miao, R. Wang, A graph-theoretical clustering method based on two rounds of minimum spanning trees, Pattern Recognit. 43 (3) (2010) 752–766.

[14]

M. Ester, H.-P. Kriegel, J. Sander, Algorithms and applications for spatial data mining, Geographic Data Min. Knowl. Discovery 5 (6) (2001).

[15]

W.A. Barbakh, Y. Wu, C. Fyfe, Non-Standard Parameter Adaptation for Exploratory Data Analysis, 249, Springer, 2009.

[16]

J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques, Elsevier, 2011.

Digital Library

[17]

J. Hou, W. Liu, E. Xu, H. Cui, Towards parameter-independent data clustering and image segmentation, Pattern Recognit. 60 (2016) 25–36.

[18]

S. Mitra, P.P. Kundu, Satellite image segmentation with shadowed c -means, Inf. Sci. 181 (17) (2011) 3601–3613.

[19]

S. Das, S. Sil, Kernel-induced fuzzy clustering of image pixels with an improved differential evolution algorithm, Inf. Sci. Int. J. 180 (8) (2010) 1237–1256.

[20]

Y.C. Song, H.D. Meng, M.J. OGrady, G.M.P. OHare, The application of cluster analysis in geophysical data interpretation, Comput. Geosci. 14 (2) (2010) 263–271.

[21]

A. Ghosh, N.S. Mishra, S. Ghosh, Fuzzy clustering algorithms for unsupervised change detection in remote sensing images, Inf. Sci. 181 (4) (2011) 699–715.

[22]

Y.J. Wang, H.S. Lee, A clustering method to identify representative financial ratios, Inf. Sci. 178 (4) (2008) 1087–1097.

[23]

J. Li, K. Wang, L. Xu, Chameleon based on clustering feature tree and its application in customer segmentation, Ann. Oper. Res. 168 (1) (2009) 225–245.

[24]

Q. Bsoul, J. Salim, L.Q. Zakaria, An intelligent document clustering approach to detect crime patterns, Procedia Technol. 11 (1) (2013) 1181–1187.

[25]

C.-W. Huang, K.-P. Lin, M.-C. Wu, K.-C. Hung, G.-S. Liu, C.-H. Jen, Intuitionistic fuzzy c-means clustering algorithm with neighborhood attraction in segmenting medical image, Soft Comput. 19 (2) (2015) 459–470.

[26]

V.P. Ananthi, P. Balasubramaniam, T. Kalaiselvi, A new fuzzy clustering algorithm for the segmentation of brain tumor, Soft Comput. (2015) 1–21.

[27]

R. Chinchuluun, W.S. Lee, J. Bhorania, P.M. Pardalos, Clustering and Classification Algorithms in Food and Agricultural Applications: A Survey, Springer US, 2009.

[28]

A. Hatamlou, Black hole: a new heuristic optimization approach for data clustering, Inf. Sci. 222 (3) (2013) 175–184.

[29]

J.G. Lee, J. Han, K.Y. Whang, Trajectory clustering: a partition-and-group framework, ACM SIGMOD International Conference on Management of Data, 2007, pp. 593–604.

[30]

X.T. Yuan, B.G. Hu, R. He, Agglomerative mean-shift clustering, IEEE Trans. Knowl. Data Eng. 24 (2) (2012) 209–219.

Digital Library

[31]

W.-Y. Chen, Y. Song, H. Bai, C.-J. Lin, E.Y. Chang, Parallel spectral clustering in distributed systems, IEEE Trans. Pattern Anal. Mach. Intell. 33 (3) (2011) 568–586.

Digital Library

[32]

S. Mitra, A parallel clustering technique for the vehicle routing problem with split deliveries and pickups, J. Oper. Res. Soc. 59 (11) (2008) 1532–1546.

[33]

M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise., Kdd, 96, 1996, pp. 226–231.

[34]

B. Borah, D. Bhattacharyya, An improved sampling-based dbscan for large spatial databases, Intelligent Sensing and Information Processing, 2004. Proceedings of International Conference on, IEEE, 2004, pp. 92–96.

[35]

C. Ruiz, M. Spiliopoulou, E. Menasalvas, C-dbscan: density-based clustering with constraints, International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing, Springer, 2007, pp. 216–223.

[36]

Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, J. Fan, Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce, Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on, IEEE, 2011, pp. 473–480.

[37]

K.M. Kumar, A.R.M. Reddy, A fast dbscan clustering algorithm by accelerating neighbor searching using groups method, Pattern Recognit. 58 (2016) 39–48.

[38]

A. Tramacere, C. Vecchio, γ-ray dbscan: a clustering algorithm applied to fermi-lat γ-ray data-i. detection performances with real and simulated data, Astron. Astrophys. 549 (2013) A138.

[39]

S.T. Mai, S. Goebl, C. Plant, A similarity model and segmentation algorithm for white matter fiber tracts, 2012 IEEE 12th International Conference on Data Mining, IEEE, 2012, pp. 1014–1019.

[40]

J. Gan, Y. Tao, Dbscan revisited: mis-claim, un-fixability, and approximation, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, 2015, pp. 519–530.

[41]

J. Wang, Z. Deng, K.-S. Choi, Y. Jiang, X. Luo, F.-L. Chung, S. Wang, Distance metric learning for soft subspace clustering in composite kernel space, Pattern Recognit. 52 (2016) 113–134.

Digital Library

[42]

P. Qian, Y. Jiang, Z. Deng, L. Hu, S. Sun, S. Wang, R.F. Muzic, Cluster prototypes and fuzzy memberships jointly leveraged cross-domain maximum entropy clustering, IEEE Trans. Cybern. 46 (1) (2016) 181–193.

[43]

Z. Deng, Y. Jiang, F.-L. Chung, H. Ishibuchi, K.-S. Choi, S. Wang, Transfer prototype-based fuzzy clustering, IEEE Trans. Fuzzy Syst. 24 (5) (2016) 1210–1232.

[44]

Z. Deng, K.-S. Choi, F.-L. Chung, S. Wang, Enhanced soft subspace clustering integrating within-cluster and between-cluster information, Pattern Recognit. 43 (3) (2010) 767–781.

Digital Library

[45]

A. Gunawan, A Faster Algorithm for DBSCAN, Ph.D. thesis, Masters thesis, Technische University Eindhoven, 2013.

[46]

P. Viswanath, V.S. Babu, Rough-dbscan: a fast hybrid density based clustering method for large data sets, Pattern Recognit. Lett. 30 (16) (2009) 1477–1488.

Digital Library

[47]

D. Birant, A. Kut, St-dbscan: an algorithm for clustering spatial–temporal data, Data Knowl. Eng. 60 (1) (2007) 208–221.

Digital Library

[48]

S. Mahran, K. Mahar, Using grid for accelerating density-based clustering, IEEE International Conference on Computer and Information Technology, 2008, pp. 35–40.

[49]

C. Xiaoyun, M. Yufang, Z. Yan, W. Ping, Gmdbscan: multi-density dbscan cluster based on grid, IEEE International Conference on E-Business Engineering, 2008, pp. 780–783.

[50]

O. Uncu, W.A. Gruver, D.B. Kotak, D. Sabaz, Gridbscan: grid density-based spatial clustering of applications with noise, IEEE International Conference on Systems, Man and Cybernetics, 2006, pp. 2976–2981.

[51]

L. Zhang, Z. Xu, F. Si, Gcmddbscan: multi-density dbscan based on grid and contribution, IEEE International Conference on Dependable, Autonomic and Secure Computing, 2013, pp. 502–507.

[52]

H. Jegou, M. Douze, C. Schmid, Product quantization for nearest neighbor search, IEEE Trans. Pattern Anal. Mach. Intell. 33 (1) (2011) 117–128.

Digital Library

[53]

A. Andoni, P. Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, Commun. ACM 51 (1) (2008) 117C122.

[54]

D.G.L. Marius Muja, Scalable nearest neighbor algorithms for high dimensional data, IEEE Trans. Pattern Anal. Mach. Intell. 36 (11) (2014) 2227–2240.

[55]

K. Buza, Feedback prediction for blogs, Data Analysis, Machine Learning and Knowledge Discovery, Springer, 2014, pp. 145–152.

[56]

K. Ricanek, T. Tesafaye, Morph: a longitudinal image database of normal adult age-progression, International Conference on Automatic Face and Gesture Recognition, 2006, pp. 341–345.

[57]

G. Karypis, E.-H. Han, V. Kumar, Chameleon: hierarchical clustering using dynamic modeling, Computer 32 (8) (1999) 68–75.

Digital Library

[58]

A. Gionis, H. Mannila, P. Tsaparas, Clustering aggregation, ACM Trans. Knowl. Discovery Data (TKDD) 1 (1) (2007) 4.

[59]

S. Maurus, C. Plant, Skinny-dip: clustering in a sea of noise, SIGKDD, ACM, 2016, pp. 1055–1064.

[60]

P.J. Huber, Robust statistics, International Encyclopedia of Statistical Science, Springer, 2011, pp. 1248–1251.

[61]

Y. Chen, S. Tang, L. Zhou, C. Wang, J. Du, T. Wang, S. Pei, Decentralized clustering by finding loose and distributed density cores, Inf. Sci. 433-434 (2018) 510–526.

[62]

Y. Chen, S. Tang, S. Pei, C. Wang, J. Du, N. Xiong, Dheat: a density heat-based algorithm for clustering with effective radius, IEEE Trans. Syst. Man. Cybern. 48 (2018) 649–660.

[63]

H.W. Kuhn, The hungarian method for the assignment problem, Naval Res. Logist. Q. 2 (1-2) (1955) 83–97.

Cited By

Sadhukhan PHalder LPalit S(2024)Approximate DBSCAN on obfuscated dataJournal of Information Security and Applications10.1016/j.jisa.2023.10366480:COnline publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1016/j.jisa.2023.103664
Wang HHuang XWu Y(2024)GD3NInformation Sciences: an International Journal10.1016/j.ins.2024.120375665:COnline publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1016/j.ins.2024.120375
Xu XHou HDing S(2024)Semi-supervised deep density clusteringApplied Soft Computing10.1016/j.asoc.2023.110903148:COnline publication date: 27-Feb-2024
https://dl.acm.org/doi/10.1016/j.asoc.2023.110903
Show More Cited By

Index Terms

A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data
1. Computing methodologies

Index terms have been assigned to the content through auto-classification.

Recommendations

A new hybrid method based on partitioning-based DBSCAN and ant clustering

Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based ...
AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
Merging DBSCAN and Density Peak for Robust Clustering
Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series
Abstract
In data clustering, density based algorithms are well known for the ability of detecting clusters of arbitrary shapes. DBSCAN is a widely used density based clustering approach, and the recently proposed density peak algorithm has shown ...

Comments

Information & Contributors

Information

Published In

cover image Pattern Recognition

Pattern Recognition Volume 83, Issue C

Nov 2018

511 pages

ISSN:0031-3203

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 November 2018

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sadhukhan PHalder LPalit S(2024)Approximate DBSCAN on obfuscated dataJournal of Information Security and Applications10.1016/j.jisa.2023.10366480:COnline publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1016/j.jisa.2023.103664
Wang HHuang XWu Y(2024)GD3NInformation Sciences: an International Journal10.1016/j.ins.2024.120375665:COnline publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1016/j.ins.2024.120375
Xu XHou HDing S(2024)Semi-supervised deep density clusteringApplied Soft Computing10.1016/j.asoc.2023.110903148:COnline publication date: 27-Feb-2024
https://dl.acm.org/doi/10.1016/j.asoc.2023.110903
Tian MLiang JZhang DZhang XWang ZLi H(2023)Detection of Financial Fraudulent Activities with Machine Learning:A Case Study of Detecting Potential Tax and Invoice FraudProceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence10.1145/3638584.3638669(33-39)Online publication date: 8-Dec-2023
https://dl.acm.org/doi/10.1145/3638584.3638669
Huang XMa T(2023)Fast Density-Based Clustering: Geometric ApproachProceedings of the ACM on Management of Data10.1145/35889121:1(1-24)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588912
Guan JLi SHe XZhu JChen JSi P(2023)SMMP: A Stable-Membership-Based Auto-Tuning Multi-Peak Clustering AlgorithmIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.321357445:5(6307-6319)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1109/TPAMI.2022.3213574
Guan JLi SChen XHe XChen J(2023)DEMOS: Clustering by Pruning a Density-Boosting Cluster Tree of Density MountsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326645135:10(10814-10830)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1109/TKDE.2023.3266451
Huang XMa TLiu CLiu S(2023)GriT-DBSCANPattern Recognition10.1016/j.patcog.2023.109658142:COnline publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1016/j.patcog.2023.109658
Ding SLi CXu XDing LZhang JGuo LShi T(2023)A Sampling-Based Density Peaks Clustering Algorithm for Large-Scale DataPattern Recognition10.1016/j.patcog.2022.109238136:COnline publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1016/j.patcog.2022.109238
Wei XPeng MHuang HZhou Y(2023)An overview on density peaks clusteringNeurocomputing10.1016/j.neucom.2023.126633554:COnline publication date: 14-Oct-2023
https://dl.acm.org/doi/10.1016/j.neucom.2023.126633
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents