Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/SC.2014.51acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Pardicle: parallel approximate density-based clustering

Published: 16 November 2014 Publication History

Abstract

Dbscan is a widely used isodensity-based clustering algorithm for particle data well-known for its ability to isolate arbitrarily-shaped clusters and to filter noise data. The algorithm is super-linear (O(nlogn)) and computationally expensive for large datasets. Given the need for speed, we propose a fast heuristic algorithm for Dbscan using density based sampling, which performs equally well in quality compared to exact algorithms, but is more than an order of magnitude faster. Our experiments on astrophysics and synthetic massive datasets (8.5 billion numbers) shows that our approximate algorithm is up to 56x faster than exact algorithms with almost identical quality (Omega-Index ≥ 0.99). We develop a new parallel Dbscan algorithm, which uses dynamic partitioning to improve load balancing and locality. We demonstrate near-linear speedup on shared memory (15x using 16 cores, single node Intel® Xeon® processor) and distributed memory (3917x using 4096 cores, multinode) computers, with 2x additional performance improvement using Intel® Xeon Phi coprocessors. Additionally, existing exact algorithms can achieve up to 3.4 times speedup using dynamic partitioning.

References

[1]
A. Mukhopadhyay and U. Maulik, "Unsupervised satellite image segmentation by combining SA based fuzzy clustering with support vector machine," in Proceedings of 7th ICAPR'09. IEEE, 2009, pp. 381--384.
[2]
D. Birant and A. Kut, "ST-DBSCAN: An algorithm for clustering spatial-temporal data," Data & Knowledge Engineering, vol. 60, no. 1, pp. 208--221, 2007.
[3]
S. Madeira and A. Oliveira, "Biclustering algorithms for biological data analysis: a survey," Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. 1, no. 1, pp. 24--45, 2004.
[4]
"Parallel K-means data clustering," 2005, http://users.eecs.northwestern.edu/wkliao/Kmeans/.
[5]
H. Park and C. Jun, "A simple and fast algorithm for K-medoids clustering," Expert Systems with Applications, vol. 36, no. 2, pp. 3336--3341, 2009.
[6]
M. Ester, H. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise," in Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, vol. 1996. AAAI Press, 1996, pp. 226--231.
[7]
T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: an efficient data clustering method for very large databases," in ACM SIGMOD Record, vol. 25(2). ACM, 1996, pp. 103--114.
[8]
W. Wang, J. Yang, and R. Muntz, "STING: A statistical information grid approach to spatial data mining," in Proceedings of the International Conference on Very Large Data Bases. IEEE, 1997, pp. 186--195.
[9]
G. Sheikholeslami, S. Chatterjee, and A. Zhang, "WaveCluster: a wavelet-based clustering approach for spatial data in very large databases," The VLDB Journal, vol. 8, no. 3, pp. 289--304, 2000.
[10]
T. M. Thang and J. Kim, "The anomaly detection by using dbscan clustering with multiple parameters," in Information Science and Applications (ICISA), 2011 International Conference on. IEEE, 2011, pp. 1--5.
[11]
M. Celik, F. Dadaser-Celik, and A. Dokuz, "Anomaly detection in temperature data using dbscan algorithm," in Innovations in Intelligent Systems and Applications (INISTA), 2011 International Symposium on. IEEE, 2011, pp. 91--95.
[12]
I.-C. Lin, "An efficient dbscan clustering algorithm based on cloud computing," 2012.
[13]
Y.-C. Xu, M. Zhu, Z. Ke, Y. Liu, and S. Sun, "Isolating ships from shape curve with dbscan," in Control and Decision Conference (CCDC), 2013 25th Chinese. IEEE, 2013, pp. 892--896.
[14]
S. Huo, "Detecting self-correlation of nonlinear, lognormal, time-series data via dbscan clustering method, using stock price data as example," Ph.D. dissertation, The Ohio State University, 2011.
[15]
M. Surdeanu, J. Turmo, and A. Ageno, "A hybrid unsupervised approach for document clustering," in Proceedings of the 11th ACM SIGKDD. ACM, 2005, pp. 685--690.
[16]
X.-H. Gao, "Membership determination of open cluster ngc 188 based on the dbscan clustering algorithm," Research in Astronomy and Astrophysics, vol. 14, no. 2, p. 159, 2014.
[17]
A. Tramacere and C. Vecchio, "γ-ray dbscan: a clustering algorithm applied to fermi-lat γ-ray data i. detection performances with real and simulated data." Astronomy & Astrophysics/Astronomie et Astrophysique, vol. 549, 2013.
[18]
Z. Lukić, D. Reed, S. Habib, and K. Heitmann, "The structure of halos: Implications for group and cluster cosmology," The Astrophysical Journal, vol. 692, no. 1, p. 217, 2009.
[19]
M. B. Kennel, "KDTREE 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space," 2004, institute for Nonlinear Science, University of California.
[20]
J. Meng, S. Chakradhar, and A. Raghunathan, "Best-effort parallel execution framework for recognition and mining applications," in Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, May 2009, pp. 1--12.
[21]
P. Viswanath and R. Pinkesh, "l-dbscan: A fast hybrid density based clustering method," in Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, vol. 1. IEEE, 2006, pp. 912--915.
[22]
S. H. Yeganeh, J. Habibi, H. Abolhassani, M. A. Tehrani, and J. Esmaelnezhad, "An approximation algorithm for finding skeletal points for density based clustering approaches," in Computational Intelligence and Data Mining, 2009. CIDM'09. IEEE Symposium on. IEEE, 2009, pp. 403--410.
[23]
B. Borah and D. Bhattacharyya, "An improved sampling-based dbscan for large spatial databases," in Intelligent Sensing and Information Processing, 2004. Proceedings of International Conference on. IEEE, 2004, pp. 92--96.
[24]
M. A. Patwary, D. Palsetia, A. Agrawal, W.-k. Liao, F. Manne, and A. Choudhary, "A new scalable parallel dbscan algorithm using the disjoint-set data structure," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Los Alamitos, CA, USA: IEEE Computer Society Press, 2012, pp. 62:1--62:11.
[25]
B. Welton, E. Samanas, and B. P. Miller, "Mr. scan: Extreme scale density-based clustering using a tree-based network of gpgpu nodes," in Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, SC, vol. 13, 2013, p. 84.
[26]
X. C. Chen, A. Mueen, V. K. Narayanan, N. Karampatziakis, G. Bansal, and V. Kumar, "Online discovery of group level events in time series," Technical Report 14-004, Computer Science, University of Minnesota, Tech. Rep., 2014.
[27]
X. C. Chen, A. Mueen, V. K. Narayanan, N. Karampatziakis, G. Bansal, and V. Kumar, "Online discovery of group level events in time series," in SIAM International Conference on Data Mining, 2014.
[28]
S. Habib, V. Morozov, N. Frontiere, H. Finkel, A. Pope, and K. Heitmann, "HACC: extreme scaling and performance across diverse architectures," in Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2013, p. 6.
[29]
G. Murray, G. Carenini, and R. Ng, "Using the omega index for evaluating abstractive community detection," in Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, Stroudsburg, PA, USA, 2012, pp. 10--18. {Online}. Available: http://dl.acm.org/citation.cfm?id=2391258.2391260
[30]
J. Xie, S. Kelley, and B. K. Szymanski, "Overlapping community detection in networks: the state of the art and comparative study," ACM Computing Surveys, vol. 45, no. 4, 2013.
[31]
L. M. Collins and C. W. Dent, "Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions," Multivariate Behavioral Research, vol. 23, no. 2, pp. 231--242, 1988.
[32]
G. De Lucia and J. Blaizot, "The hierarchical formation of the brightest cluster galaxies," Monthly Notices of the Royal Astronomical Society, vol. 375, no. 1, pp. 2--14, 2007.
[33]
G. Lemson and the Virgo Consortium, "Halo and galaxy formation histories from the millennium simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony," Arxiv preprint astro-ph/0608019, 2006.
[34]
V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly et al., "Simulations of the formation, evolution and clustering of galaxies and quasars," Nature, vol. 435, no. 7042, pp. 629--636, 2005.
[35]
T. Cormen, Introduction to algorithms. The MIT press, 2001.
[36]
R. Tarjan, "A class of algorithms which require nonlinear time to maintain disjoint sets," Journal of computer and system sciences, vol. 18, no. 2, pp. 110--127, 1979.
[37]
"FLANN," http://www.cs.ubc.ca/research/flann/, 2014.
[38]
M. Patwary, J. Blair, and F. Manne, "Experiments on union-find algorithms for the disjoint-set data structure," in Proceedings of the 9th International Symposium on Experimental Algorithms (SEA 2010). Springer, LNCS 6049, 2010, pp. 411--423.
[39]
M. Muja and D. G. Lowe, "Fast approximate nearest neighbors with automatic algorithm configuration," in International Conference on Computer Vision Theory and Application VISSAPP'09). INSTICC Press, 2009, pp. 331--340.
[40]
M. Muja and D. G. Lowe, "Fast matching of binary features," in Computer and Robot Vision (CRV), 2012, pp. 404--410.
[41]
"Exponential distribution," http://en.wikipedia.org/wiki/Exponential_distribution, accessed: 2014-03-24.
[42]
M. Patwary, M. Ali, P. Refsnes, and F. Manne, "Multi-core spanning forest algorithms using the disjoint-set data structure," in Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 2012, pp. 827--835.
[43]
F. Manne and M. Patwary, "A scalable parallel union-find algorithm for distributed memory computers," in Parallel Processing and Applied Mathematics. Springer, LNCS, 2010, pp. 186--195.
[44]
"Stampede System, Texas Advanced Computing Center (TACC)," https://www.tacc.utexas.edu/stampede/, 2014, the University of Texas at Austin.
[45]
Q. Guo and et al., "From dwarf spheroidals to cD galaxies: simulating the galaxy population in a ΛCDM cosmology," Monthly Notices of the Royal Astronomical Society, vol. 413, pp. 101--131, May 2011.
[46]
Q. Guo and et al., "Galaxy formation in WMAP1 and WMAP7 cosmologies," Monthly Notices of the Royal Astronomical Society, vol. 428, pp. 1351--1365, Jan. 2013.
[47]
R. Agrawal and R. Srikant, "Quest synthetic data generator," IBM Almaden Research Center, 1994.
[48]
J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi, "NU-MineBench 3.0," Technical Report CUCIS-2005-08-01, Northwestern University, Tech. Rep., 2010.
[49]
R. Bower, A. Benson, R. Malbon, J. Helly, C. Frenk, C. Baugh, S. Cole, and C. Lacey, "Breaking the hierarchy of galaxy formation," Monthly Notices of the Royal Astronomical Society, vol. 370, no. 2, pp. 645--655, 2006.
[50]
S. Bertone, G. De Lucia, and P. Thomas, "The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model," Monthly Notices of the Royal Astronomical Society, vol. 379, no. 3, pp. 1143--1154, 2007.
[51]
Y. Liu, W.-k. Liao, and A. Choudhary, "Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation," in Proceedings of IPDPS 2003. Washington, DC, USA: IEEE, 2003, p. 82.1.
[52]
R. Layton and P. Watters, "Determining provenance in phishing websites using automated conceptual analysis," in eCrime Researchers Summit, 2009. eCRIME'09. IEEE, 2009, pp. 1--7.
[53]
"Intrinsics for intel many integrated core architecture((intel mic architecture))."
[54]
S. T. Mai, X. He, J. Feng, and C. Böhm, "Efficient anytime density-based clustering," 2013.
[55]
G. Andrade, G. Ramos, D. Madeira, R. Sachetto, R. Ferreira, and L. Rocha, "G-dbscan: A gpu accelerated algorithm for density-based clustering," Procedia Computer Science, vol. 18, pp. 369--378, 2013.
[56]
Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, and J. Fan, "Mr-dbscan: An efficient parallel density-based clustering algorithm using mapreduce," in Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on. IEEE, 2011, pp. 473--480.

Cited By

View all
  • (2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
  • (2023)Fast tree-based algorithms for DBSCAN for low-dimensional data on GPUsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605594(503-512)Online publication date: 7-Aug-2023
  • (2019)Hybrid CPU/GPU clustering in shared memory on the billion point scaleProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330349(35-45)Online publication date: 26-Jun-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2014
1054 pages
ISBN:9781479955008
  • General Chair:
  • Trish Damkroger,
  • Program Chair:
  • Jack Dongarra

Sponsors

Publisher

IEEE Press

Publication History

Published: 16 November 2014

Check for updates

Author Tags

  1. approximate clustering algorithm
  2. density based clustering
  3. disjoint-set data structure
  4. union-find algorithm

Qualifiers

  • Research-article

Conference

SC '14
Sponsor:

Acceptance Rates

SC '14 Paper Acceptance Rate 83 of 394 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
  • (2023)Fast tree-based algorithms for DBSCAN for low-dimensional data on GPUsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605594(503-512)Online publication date: 7-Aug-2023
  • (2019)Hybrid CPU/GPU clustering in shared memory on the billion point scaleProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330349(35-45)Online publication date: 26-Jun-2019
  • (2018)A Label Propagation Algorithm Based on Local Density of Data PointsProceedings of the 2nd International Conference on Algorithms, Computing and Systems10.1145/3242840.3242872(61-64)Online publication date: 27-Jul-2018
  • (2018)RP-DBSCANProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3196887(1173-1187)Online publication date: 27-May-2018
  • (2017)Exact, Fast and Scalable Parallel DBSCAN for Commodity PlatformsProceedings of the 18th International Conference on Distributed Computing and Networking10.1145/3007748.3007773(1-10)Online publication date: 5-Jan-2017
  • (2016)NG-DBSCANProceedings of the VLDB Endowment10.14778/3021924.302193210:3(157-168)Online publication date: 1-Nov-2016
  • (2015)BD-CATSProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2807591.2807616(1-12)Online publication date: 15-Nov-2015

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media