Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data

Published: 01 November 2018 Publication History

Highlights

The underlying idea is: point p and point q should have similar neighbors, provided p and q are close to each other; given a certain eps, the closer they are, the more similar their neighbors are.
NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same. While ρ-Approximate DBSCAN is an approximate algorithm.
The best complexity of NQ-DBSCAN can be O(n), and the average complexity of NQ-DBSCAN is proved to be O(n log(n)) provided the parameters are properly chosen. While ρ-Approximate DBSCAN runs only in O(n2) in high dimension.
NQ-DBSCAN is suitable for clustering data with a lot of noise.

Abstract

Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n 2) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n 2) algorithm in very high dimension such that 2 D  >  > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.

References

[1]
J. Song, L. Gao, F. Nie, H.T. Shen, Y. Yan, N. Sebe, Optimized graph learning using partial tags and multiple features for image and video annotation, IEEE Trans. Image Process. 25 (11) (2016) 4999–5011.
[2]
J. Song, L. Gao, F. Zou, Y. Yan, N. Sebe, Deep and fast: deep learning hashing with semi-supervised graph construction, Image Vision Comput. (2016).
[3]
J. Song, H.T. Shen, J. Wang, Z. Huang, N. Sebe, J. Wang, A distance-computation-free search scheme for binary code databases, IEEE Trans. Multimedia 18 (3) (2016) 484–495.
[4]
W. Zhou, M. Yang, X. Wang, H. Li, Y. Lin, Q. Tian, Scalable feature matching by dual cascaded scalar quantization for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 38 (1) (2016) 159–171.
[5]
S. Zhang, Q. Huang, S. Jiang, W. Gao, Q. Tian, Affective visualization and retrieval for music video, IEEE Trans. Multimedia 12 (6) (2010) 510–522.
[6]
A.K. Rajagopal, R. Subramanian, E. Ricci, R.L. Vieriu, O. Lanz, N. Sebe, et al., Exploring transfer learning approaches for head pose classification from multi-view surveillance images, Int. J. Comput. Vision 109 (1-2) (2014) 146–167.
[7]
Y. Yan, E. Ricci, R. Subramanian, G. Liu, O. Lanz, N. Sebe, A multi-task learning framework for head pose estimation under target motion, IEEE Trans. Pattern Anal. Mach. Intell. 38 (6) (2016) 1070–1083.
[8]
B.F. Qaqish, J.J. OBrien, J.C. Hibbard, K.J. Clowers, Accelerating high dimensional clustering with lossless data reduction, Bioinformatics (2017) btx328.
[9]
Z. Deng, K.-S. Choi, Y. Jiang, J. Wang, S. Wang, A survey on soft subspace clustering, Inf. Sci. 348 (2016) 84–106.
[10]
O. Limwattanapibool, S. Arch-int, Determination of the appropriate parameters for k-means clustering using selection of region clusters based on density dbscan (srcd-dbscan), Expert Syst. (2017).
[11]
L. Bai, X. Cheng, J. Liang, H. Shen, Y. Guo, Fast density clustering strategies based on the k-means algorithm, Pattern Recognit. 71 (2017) 375–386.
[12]
N.A. Yousri, M.S. Kamel, M.A. Ismail, A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities, Pattern Recognit. 42 (7) (2009) 1193–1209.
[13]
C. Zhong, D. Miao, R. Wang, A graph-theoretical clustering method based on two rounds of minimum spanning trees, Pattern Recognit. 43 (3) (2010) 752–766.
[14]
M. Ester, H.-P. Kriegel, J. Sander, Algorithms and applications for spatial data mining, Geographic Data Min. Knowl. Discovery 5 (6) (2001).
[15]
W.A. Barbakh, Y. Wu, C. Fyfe, Non-Standard Parameter Adaptation for Exploratory Data Analysis, 249, Springer, 2009.
[16]
J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques, Elsevier, 2011.
[17]
J. Hou, W. Liu, E. Xu, H. Cui, Towards parameter-independent data clustering and image segmentation, Pattern Recognit. 60 (2016) 25–36.
[18]
S. Mitra, P.P. Kundu, Satellite image segmentation with shadowed c -means, Inf. Sci. 181 (17) (2011) 3601–3613.
[19]
S. Das, S. Sil, Kernel-induced fuzzy clustering of image pixels with an improved differential evolution algorithm, Inf. Sci. Int. J. 180 (8) (2010) 1237–1256.
[20]
Y.C. Song, H.D. Meng, M.J. OGrady, G.M.P. OHare, The application of cluster analysis in geophysical data interpretation, Comput. Geosci. 14 (2) (2010) 263–271.
[21]
A. Ghosh, N.S. Mishra, S. Ghosh, Fuzzy clustering algorithms for unsupervised change detection in remote sensing images, Inf. Sci. 181 (4) (2011) 699–715.
[22]
Y.J. Wang, H.S. Lee, A clustering method to identify representative financial ratios, Inf. Sci. 178 (4) (2008) 1087–1097.
[23]
J. Li, K. Wang, L. Xu, Chameleon based on clustering feature tree and its application in customer segmentation, Ann. Oper. Res. 168 (1) (2009) 225–245.
[24]
Q. Bsoul, J. Salim, L.Q. Zakaria, An intelligent document clustering approach to detect crime patterns, Procedia Technol. 11 (1) (2013) 1181–1187.
[25]
C.-W. Huang, K.-P. Lin, M.-C. Wu, K.-C. Hung, G.-S. Liu, C.-H. Jen, Intuitionistic fuzzy c-means clustering algorithm with neighborhood attraction in segmenting medical image, Soft Comput. 19 (2) (2015) 459–470.
[26]
V.P. Ananthi, P. Balasubramaniam, T. Kalaiselvi, A new fuzzy clustering algorithm for the segmentation of brain tumor, Soft Comput. (2015) 1–21.
[27]
R. Chinchuluun, W.S. Lee, J. Bhorania, P.M. Pardalos, Clustering and Classification Algorithms in Food and Agricultural Applications: A Survey, Springer US, 2009.
[28]
A. Hatamlou, Black hole: a new heuristic optimization approach for data clustering, Inf. Sci. 222 (3) (2013) 175–184.
[29]
J.G. Lee, J. Han, K.Y. Whang, Trajectory clustering: a partition-and-group framework, ACM SIGMOD International Conference on Management of Data, 2007, pp. 593–604.
[30]
X.T. Yuan, B.G. Hu, R. He, Agglomerative mean-shift clustering, IEEE Trans. Knowl. Data Eng. 24 (2) (2012) 209–219.
[31]
W.-Y. Chen, Y. Song, H. Bai, C.-J. Lin, E.Y. Chang, Parallel spectral clustering in distributed systems, IEEE Trans. Pattern Anal. Mach. Intell. 33 (3) (2011) 568–586.
[32]
S. Mitra, A parallel clustering technique for the vehicle routing problem with split deliveries and pickups, J. Oper. Res. Soc. 59 (11) (2008) 1532–1546.
[33]
M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise., Kdd, 96, 1996, pp. 226–231.
[34]
B. Borah, D. Bhattacharyya, An improved sampling-based dbscan for large spatial databases, Intelligent Sensing and Information Processing, 2004. Proceedings of International Conference on, IEEE, 2004, pp. 92–96.
[35]
C. Ruiz, M. Spiliopoulou, E. Menasalvas, C-dbscan: density-based clustering with constraints, International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing, Springer, 2007, pp. 216–223.
[36]
Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, J. Fan, Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce, Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on, IEEE, 2011, pp. 473–480.
[37]
K.M. Kumar, A.R.M. Reddy, A fast dbscan clustering algorithm by accelerating neighbor searching using groups method, Pattern Recognit. 58 (2016) 39–48.
[38]
A. Tramacere, C. Vecchio, γ-ray dbscan: a clustering algorithm applied to fermi-lat γ-ray data-i. detection performances with real and simulated data, Astron. Astrophys. 549 (2013) A138.
[39]
S.T. Mai, S. Goebl, C. Plant, A similarity model and segmentation algorithm for white matter fiber tracts, 2012 IEEE 12th International Conference on Data Mining, IEEE, 2012, pp. 1014–1019.
[40]
J. Gan, Y. Tao, Dbscan revisited: mis-claim, un-fixability, and approximation, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, 2015, pp. 519–530.
[41]
J. Wang, Z. Deng, K.-S. Choi, Y. Jiang, X. Luo, F.-L. Chung, S. Wang, Distance metric learning for soft subspace clustering in composite kernel space, Pattern Recognit. 52 (2016) 113–134.
[42]
P. Qian, Y. Jiang, Z. Deng, L. Hu, S. Sun, S. Wang, R.F. Muzic, Cluster prototypes and fuzzy memberships jointly leveraged cross-domain maximum entropy clustering, IEEE Trans. Cybern. 46 (1) (2016) 181–193.
[43]
Z. Deng, Y. Jiang, F.-L. Chung, H. Ishibuchi, K.-S. Choi, S. Wang, Transfer prototype-based fuzzy clustering, IEEE Trans. Fuzzy Syst. 24 (5) (2016) 1210–1232.
[44]
Z. Deng, K.-S. Choi, F.-L. Chung, S. Wang, Enhanced soft subspace clustering integrating within-cluster and between-cluster information, Pattern Recognit. 43 (3) (2010) 767–781.
[45]
A. Gunawan, A Faster Algorithm for DBSCAN, Ph.D. thesis, Masters thesis, Technische University Eindhoven, 2013.
[46]
P. Viswanath, V.S. Babu, Rough-dbscan: a fast hybrid density based clustering method for large data sets, Pattern Recognit. Lett. 30 (16) (2009) 1477–1488.
[47]
D. Birant, A. Kut, St-dbscan: an algorithm for clustering spatial–temporal data, Data Knowl. Eng. 60 (1) (2007) 208–221.
[48]
S. Mahran, K. Mahar, Using grid for accelerating density-based clustering, IEEE International Conference on Computer and Information Technology, 2008, pp. 35–40.
[49]
C. Xiaoyun, M. Yufang, Z. Yan, W. Ping, Gmdbscan: multi-density dbscan cluster based on grid, IEEE International Conference on E-Business Engineering, 2008, pp. 780–783.
[50]
O. Uncu, W.A. Gruver, D.B. Kotak, D. Sabaz, Gridbscan: grid density-based spatial clustering of applications with noise, IEEE International Conference on Systems, Man and Cybernetics, 2006, pp. 2976–2981.
[51]
L. Zhang, Z. Xu, F. Si, Gcmddbscan: multi-density dbscan based on grid and contribution, IEEE International Conference on Dependable, Autonomic and Secure Computing, 2013, pp. 502–507.
[52]
H. Jegou, M. Douze, C. Schmid, Product quantization for nearest neighbor search, IEEE Trans. Pattern Anal. Mach. Intell. 33 (1) (2011) 117–128.
[53]
A. Andoni, P. Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, Commun. ACM 51 (1) (2008) 117C122.
[54]
D.G.L. Marius Muja, Scalable nearest neighbor algorithms for high dimensional data, IEEE Trans. Pattern Anal. Mach. Intell. 36 (11) (2014) 2227–2240.
[55]
K. Buza, Feedback prediction for blogs, Data Analysis, Machine Learning and Knowledge Discovery, Springer, 2014, pp. 145–152.
[56]
K. Ricanek, T. Tesafaye, Morph: a longitudinal image database of normal adult age-progression, International Conference on Automatic Face and Gesture Recognition, 2006, pp. 341–345.
[57]
G. Karypis, E.-H. Han, V. Kumar, Chameleon: hierarchical clustering using dynamic modeling, Computer 32 (8) (1999) 68–75.
[58]
A. Gionis, H. Mannila, P. Tsaparas, Clustering aggregation, ACM Trans. Knowl. Discovery Data (TKDD) 1 (1) (2007) 4.
[59]
S. Maurus, C. Plant, Skinny-dip: clustering in a sea of noise, SIGKDD, ACM, 2016, pp. 1055–1064.
[60]
P.J. Huber, Robust statistics, International Encyclopedia of Statistical Science, Springer, 2011, pp. 1248–1251.
[61]
Y. Chen, S. Tang, L. Zhou, C. Wang, J. Du, T. Wang, S. Pei, Decentralized clustering by finding loose and distributed density cores, Inf. Sci. 433-434 (2018) 510–526.
[62]
Y. Chen, S. Tang, S. Pei, C. Wang, J. Du, N. Xiong, Dheat: a density heat-based algorithm for clustering with effective radius, IEEE Trans. Syst. Man. Cybern. 48 (2018) 649–660.
[63]
H.W. Kuhn, The hungarian method for the assignment problem, Naval Res. Logist. Q. 2 (1-2) (1955) 83–97.

Cited By

View all

Index Terms

  1. A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Pattern Recognition
    Pattern Recognition  Volume 83, Issue C
    Nov 2018
    511 pages

    Publisher

    Elsevier Science Inc.

    United States

    Publication History

    Published: 01 November 2018

    Author Tags

    1. DBSCAN
    2. ρ-Approximate DBSCAN
    3. NQ-DBSCAN

    Author Tags

    1. 00-01
    2. 99-00

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media