Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

G2P: A Partitioning Approach for Processing DBSCAN with MapReduce

  • Conference paper
  • First Online:
Web and Wireless Geographical Information Systems (W2GIS 2015)

Abstract

One of the most important aspects to consider when computing large data sets is to distribute and parallelize the analysis algorithms. A distributed system presents a good performance if the workload is properly balanced. It is expected that the computing time is directly related to the processing time on the node where the processing takes longer. This paper aims at proposing a data partitioning strategy that takes into account partition balance and that is generic for spatial data. Our proposed solution is based on a grid model data structure that is further transformed into a graph partitioning problem, where we finally compute the partitions. Our proposed approach is used on the distributed DBSCAN algorithm and it is focused on finding density areas in a large data set using MapReduce. We call our approach G2P (Grid and Graph Partitioning) and we show via massive experiments that G2P presents great quality data partitioning for the distributed DBSCAN algorithm compared to the competitors. We believe that G2P is not only suitable for DBSCAN algorithm, but also to execute spatial join operations and distance based range queries to name to a few.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Schadt, E.E., Linderman, M.D., Sorenson, J., Lee, L., Nolan, G.P.: Computational solutions to large-scale data management and analysis. In: Nature Reviews Genetics, pp. 647–657. Nature Publishing Group (2010)

    Google Scholar 

  2. Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: Optics: ordering points to identify the clustering structure. ACM SIGMOD Record 28(2), 49–60 (1999)

    Article  Google Scholar 

  3. Bentley, J.L.: Multidimensional binary search trees used for associative searching. In: Communications of the ACM, vol. 18, pp. 509–517. ACM (1975)

    Google Scholar 

  4. Coelho da Silva, T.L., Araujo, A.C.N., Magalhaes, R.P., Farias, V.A.E., de Macedo, J.A., Machado, J.C.: Efficient and distributed dbscan algorithm using mapreduce to detect density areas on traffic data. In: ICEIS (2014)

    Google Scholar 

  5. Dai, B.-R., Lin, I.-C.: Efficient map/reduce-based dbscan algorithm with optimized data partition. In: 2012 IEEE 5th International Conference on Cloud Computing (CLOUD), pp. 59–66. IEEE (2012)

    Google Scholar 

  6. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  7. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)

    Google Scholar 

  8. Giannotti, F., Nanni, M., Pedreschi, D., Pinelli, F., Renso, C., Rinzivillo, S., Trasarti, R.: Unveiling the complexity of human mobility by querying and mining massive trajectory data. VLDB J. 20(5), 695–719 (2011)

    Article  Google Scholar 

  9. He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., Fan, J.: Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce. In: 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), pp. 473–480. IEEE (2011)

    Google Scholar 

  10. Jensen, C.S., Lin, D., Ooi, B.-C.: Continuous clustering of moving objects. IEEE Transactions on Knowledge and Data Engineering 19(9), 1161–1174 (2007)

    Article  Google Scholar 

  11. Jeung, H., Yiu, M.L., Zhou, X., Jensen, C.S., Shen, H.T.: Discovery of convoys in trajectory databases. Proceedings of the VLDB Endowment 1(1), 1068–1080 (2008)

    Article  Google Scholar 

  12. Kisilevich, S., Mansmann, F., Keim, D.: P-dbscan: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos. In: Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research & Application, p. 38. ACM (2010)

    Google Scholar 

  13. Li, X., Ceikute, V., Jensen, C.S., Tan, K.-L.: Effective online group discovery in trajectory databases. IEEE Transactions on Knowledge and Data Engineering 25(12), 2752–2766 (2013)

    Article  Google Scholar 

  14. Li, Y., Han, J., Yang, J.: Clustering moving objects. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 617–622 (2004)

    Google Scholar 

  15. Lin, J., Dyer, C.: Data-intensive text processing with mapreduce. Synthesis Lectures on Human Language Technologies 3(1), 1–177 (2010)

    Article  Google Scholar 

  16. Uncu, O., Gruver, W.A., Kotak, D.B., Sabaz, D., Alibhai, Z., Ng, C.: Gridbscan: grid density-based spatial clustering of applications with noise. In: IEEE International Conference on Systems, Man and Cybernetics, SMC 2006, vol. 4, pp. 2976–2981. IEEE (2006)

    Google Scholar 

  17. Vieira, M.R., Bakalov, P., Tsotras, V.J.: On-line discovery of flock patterns in spatio-temporal data. In: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 286–295. ACM (2009)

    Google Scholar 

  18. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 165–178. ACM (2009)

    Google Scholar 

  19. Welton, B., Samanas, E., Miller, B.P.: Mr. scan: extreme scale density-based clustering using a tree-based network of gpgpu nodes. In: Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, p. 84. ACM (2013)

    Google Scholar 

  20. Kim, Y., Shim, K., Kim, M.-S., Lee, J.S.: DBCURE-MR: an efficient density-based clustering algorithm for large data using MapReduce. In: Information Systems, pp. 15–35. Elsevier (2014)

    Google Scholar 

  21. Karypis, G., Kumar, V.,: Metis-unstructured graph partitioning and sparse matrix ordering system, version 2.0. Citeseer (1995)

    Google Scholar 

  22. Andreev, K., Racke, H.: Balanced graph partitioning. In: Theory of Computing Systems, pp. 15–35. Springer (2006)

    Google Scholar 

  23. Yahoo! Webscope. Yahoo! Webscope dataset YFCC-100M (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Cavalcante Araujo Neto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Araujo Neto, A.C., Coelho da Silva, T.L., de Farias, V.A.E., Macêdo, J.A.F., de Castro Machado, J. (2015). G2P: A Partitioning Approach for Processing DBSCAN with MapReduce. In: Gensel, J., Tomko, M. (eds) Web and Wireless Geographical Information Systems. W2GIS 2015. Lecture Notes in Computer Science(), vol 9080. Springer, Cham. https://doi.org/10.1007/978-3-319-18251-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18251-3_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18250-6

  • Online ISBN: 978-3-319-18251-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics