Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2670979.2670997acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
tutorial
Open access

Connected Components in MapReduce and Beyond

Published: 03 November 2014 Publication History

Abstract

Computing connected components of a graph lies at the core of many data mining algorithms, and is a fundamental subroutine in graph clustering. This problem is well studied, yet many of the algorithms with good theoretical guarantees perform poorly in practice, especially when faced with graphs with hundreds of billions of edges. In this paper, we design improved algorithms based on traditional MapReduce architecture for large scale data analysis. We also explore the effect of augmenting MapReduce with a distributed hash table (DHT) service. We show that these algorithms have provable theoretical guarantees, and easily outperform previously studied algorithms, sometimes by more than an order of magnitude. In particular, our iterative MapReduce algorithms run 3 to 15 times faster than the best previously studied algorithms, and the MapReduce implementation using a DHT is 10 to 30 times faster than the best previously studied algorithms. These are the fastest algorithms that easily scale to graphs with hundreds of billions of edges.

References

[1]
Stanford large network dataset collection. http://snap.stanford.edu/data/index.html.
[2]
Temporal evolution of the uk web. http://law.di.unimi.it/webdata/uk-2007-05/.
[3]
F. N. Afrati, V. Borkar, M. Carey, N. Polyzotis, and J. D. Ullman. Map-reduce extensions and recursive queries. In EDBT, 2011.
[4]
D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In In Fourth SIAM International Conference on Data Mining (SDM), 2004.
[5]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI'06: Seventh Symposium on Operating System Design and Implementation, 2006.
[6]
A. Ching and C. Kunz. Giraph: Large-scale graph processing on hadoop. In Hadoop Summit, 2010.
[7]
J. Cohen. Graph Twiddling in a MapReduce World. Computing in Science and Engineering, 11(4):29--41, July 2009.
[8]
E. Dahlhaus. Parallel algorithms for hierarchical clustering and applications to split decomposition and parity graph recognition. J. Algorithms, 36(2).
[9]
C. Doll, T. Hartmann, and D. Wagner. Fully-dynamic hierarchical graph clustering using cut trees. In WADS, 2011.
[10]
H. Gazit. An optimal randomized parallel algorithm for finding connected components in a graph. SIAM J. Comput., 20(6):1046--1067, 1991.
[11]
M. T. Goodrich, N. Sitchinava, and Q. Zhang. Sorting, searching, and simulation in the mapreduce framework. In T. Asano, S.-I. Nakano, Y. Okamoto, and O. Watanabe, editors, ISAAC, volume 7074 of Lecture Notes in Computer Science, pages 374--383. Springer, 2011.
[12]
H. il Koo and D. H. Kim. Scene text detection via connected component clustering and nontext filtering. IEEE Transactions on Image Processing, 22(6), 2013.
[13]
U. Kang and C. Faloutsos. Big graph mining: algorithms and discoveries. SIGKDD Explorations, 14(2), 2012.
[14]
U. Kang, M. McGlohon, L. Akoglu, and C. Faloutsos. Patterns on the connected components of terabyte-scale graphs. In ICDM, 2010.
[15]
U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A Peta-Scale Graph Mining System- Implementation and Observations. 2009.
[16]
D. R. Karger, N. Nisan, and M. Parnas. Fast connected components algorithms for the erew pram. SIAM J. Comput., 28(3):1021--1034, 1999.
[17]
H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, 2010.
[18]
H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th International World Wide Web (WWW) Conference, 2010.
[19]
A. Kyrola, G. E. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. In OSDI, pages 31--46, 2012.
[20]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8):716--727, Apr. 2012.
[21]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010.
[22]
C. F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21, 1995.
[23]
S. J. Plimpton and K. D. Devine. MapReduce in MPI for Large-scale Graph Algorithms. Special issue of Parallel Computing, 2011.
[24]
S. Rajasekaran. Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Syst., 16(6).
[25]
V. Rastogi, A. Machanavajjhala, L. Chitnis, and A. D. Sarma. Finding connected components in map-reduce in logarithmic rounds. http://www.cs.duke.edu/ ashwin/pubs/cc-icde13-full.pdf, 2012.
[26]
V. Rastogi, A. Machanavajjhala, L. Chitnis, and A. D. Sarma. Finding connected components in map-reduce in logarithmic rounds. In ICDE, 2013.
[27]
J. Reif. Optimal parallel algorithms for interger sorting and graph connectivity. In Technical report, 1985.
[28]
T. Seidl, B. Boden, and S. Fries. Cc-mr - finding connected components in huge graphs with mapreduce. In ECML/PKDD, 2012.
[29]
Y. Shiloach and U. Vishkin. An O(logn) parallel connectivity algorithm. Journal of Algorithms, 3:57--67, 1982.
[30]
S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011.
[31]
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990.

Cited By

View all
  • (2024)BTS: Load-Balanced Distributed Union-Find for Finding Connected Components with Balanced Tree Structures2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00089(1090-1102)Online publication date: 13-May-2024
  • (2024)A MapReduce-Based Approach for Fast Connected Components Detection from Large-Scale NetworksBig Data10.1089/big.2022.0264Online publication date: 29-Jan-2024
  • (2023)FENCE: Fairplay Ensuring Network Chain Entity for Real-Time Multiple ID Detection at Scale In Fantasy SportsProceedings of the Third International Conference on AI-ML Systems10.1145/3639856.3639882(1-7)Online publication date: 25-Oct-2023
  • Show More Cited By

Index Terms

  1. Connected Components in MapReduce and Beyond

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SOCC '14: Proceedings of the ACM Symposium on Cloud Computing
    November 2014
    383 pages
    ISBN:9781450332521
    DOI:10.1145/2670979
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 November 2014

    Check for updates

    Author Tags

    1. Connected Components
    2. MapReduce Algorithms

    Qualifiers

    • Tutorial
    • Research
    • Refereed limited

    Conference

    SOCC '14
    Sponsor:
    SOCC '14: ACM Symposium on Cloud Computing
    November 3 - 5, 2014
    WA, Seattle, USA

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,440
    • Downloads (Last 6 weeks)190
    Reflects downloads up to 01 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)BTS: Load-Balanced Distributed Union-Find for Finding Connected Components with Balanced Tree Structures2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00089(1090-1102)Online publication date: 13-May-2024
    • (2024)A MapReduce-Based Approach for Fast Connected Components Detection from Large-Scale NetworksBig Data10.1089/big.2022.0264Online publication date: 29-Jan-2024
    • (2023)FENCE: Fairplay Ensuring Network Chain Entity for Real-Time Multiple ID Detection at Scale In Fantasy SportsProceedings of the Third International Conference on AI-ML Systems10.1145/3639856.3639882(1-7)Online publication date: 25-Oct-2023
    • (2023)Node-Differentially Private Estimation of the Number of Connected ComponentsProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588671(183-194)Online publication date: 18-Jun-2023
    • (2023)MapReduce for Graphs Processing: New Big Data Algorithm for 2-Edge Connected Components and Future IdeasIEEE Access10.1109/ACCESS.2023.328126611(54986-55001)Online publication date: 2023
    • (2022)UniCon: A unified star-operation to efficiently find connected components on a cluster of commodity hardwarePLOS ONE10.1371/journal.pone.027752717:11(e0277527)Online publication date: 30-Nov-2022
    • (2022)Equivalence classes and conditional hardness in massively parallel computationsDistributed Computing10.1007/s00446-021-00418-2Online publication date: 20-Jan-2022
    • (2021)ConnectItProceedings of the VLDB Endowment10.14778/3436905.343692314:4(653-667)Online publication date: 22-Feb-2021
    • (2021)Massively Parallel Computation via Remote Memory AccessACM Transactions on Parallel Computing10.1145/34706318:3(1-25)Online publication date: 20-Sep-2021
    • (2021)DIGDUG: Scalable Separable Dense Graph Pruning and Join Operations in MapReduceIEEE Transactions on Big Data10.1109/TBDATA.2020.29836507:6(930-951)Online publication date: 1-Dec-2021
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media