Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Distributed GraphLab: a framework for machine learning and data mining in the cloud

Published: 01 April 2012 Publication History

Abstract

While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees.
We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

References

[1]
R. Angles and C. Gutierrez. Survey of graph database models. ACM Comput. Surv., 40(1):1:1--1:39, 2008.
[2]
A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, pages 81--88. 2008.
[3]
D. Batra, A. Kowdle, D. Parikh, L. Jiebo, and C. Tsuhan. iCoseg: Interactive co-segmentation with intelligent scribble guidance. In CVPR, pages 3169--3176, 2010.
[4]
D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods. Prentice-Hall, Inc., 1989.
[5]
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.
[6]
K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63--75, 1985.
[7]
R. Chen, X. Weng, B. He, and M. Yang. Large graph processing in the cloud. In SIGMOD, pages 1123--1126, 2010.
[8]
C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages 281--288. 2006.
[9]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, 2004.
[10]
B. Efron, T. Hastie, I. M. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407--499, 2004.
[11]
G. Elidan, I. McGraw, and D. Koller. Residual Belief Propagation: Informed scheduling for asynchronous message passing. In UAI, pages 165--173, 2006.
[12]
J. Gonzalez, Y. Low, A. Gretton, and C. Guestrin. Parallel gibbs sampling: From colored fields to thin junction trees. In AISTATS, volume 15, pages 324--332, 2011.
[13]
J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS, volume 5, pages 177--184, 2009.
[14]
J. Gonzalez, Y. Low, C. Guestrin, and D. O'Hallaron. Distributed parallel inference on large factor graphs. In UAI, 2009.
[15]
H. Graf, E. Cosatto, L. Bottou, I. Dourdanovic, and V. Vapnik. Parallel support vector machines: The cascade SVM. In NIPS, pages 521--528, 2004.
[16]
D. Gregor and A. Lumsdaine. The parallel BGL: A generic library for distributed graph computations. POOSC, 2005.
[17]
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber. Comparative evaluation of latency reducing and tolerating techniques. SIGARCH Comput. Archit. News, 19(3):254--263, 1991.
[18]
B. Hindman, A. Konwinski, M. Zaharia, and I. Stoica. A common substrate for cluster computing. In HotCloud, 2009.
[19]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.
[20]
U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In ICDM, pages 229--238, 2009.
[21]
G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput., 48(1):96--129, 1998.
[22]
S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In SPAA, pages 85--94, 2011.
[23]
J. Leskovec. Stanford large network dataset collection. http://snap.stanford.edu/data/index.html, 2011.
[24]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In UAI, pages 340--349, 2010.
[25]
G. Malewicz, M. H. Austern, A. J. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.
[26]
J. Misra. Detecting termination of distributed computations using markers. In PODC, pages 290--294, 1983.
[27]
R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational EM for latent Dirichlet allocation: An experimental evaluation of speed and scalability. In ICDM Workshops, pages 349--354, 2007.
[28]
R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355--368. 1998.
[29]
Neo4j. http://neo4j.org, 2011.
[30]
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In NIPS, pages 1081--1088, 2007.
[31]
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, 1999.
[32]
R. Pearce, M. Gokhale, and N. Amato. Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory. In SC, pages 1--11, 2010.
[33]
R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In OSDI, 2010.
[34]
A. G. Siapas. Criticality and parallelism in combinatorial optimization. PhD thesis, Massachusetts Institute of Technology, 1996.
[35]
A. J. Smola and S. Narayanamurthy. An Architecture for Parallel Topic Models. PVLDB, 3(1):703--710, 2010.
[36]
S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011.
[37]
J. W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17:530--531, 1974.
[38]
M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In HotCloud, 2010.
[39]
Y. Zhang, Q. Gao, L. Gao, and C. Wang. Priter: a distributed framework for prioritized iterative computations. In SOCC, pages 13:1--13:14, 2011.
[40]
Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In AAIM, pages 337--348, 2008.

Cited By

View all
  • (2024)Hypergraph-based locality-enhancing methods for graph operations in Big Data applicationsInternational Journal of High Performance Computing Applications10.1177/1094342023121453238:3(210-224)Online publication date: 1-May-2024
  • (2024)A Survey of Distributed Graph Algorithms on Massive GraphsACM Computing Surveys10.1145/369496657:2(1-39)Online publication date: 10-Oct-2024
  • (2024)SuperCSR: A Space-Time-Efficient CSR Representation for Large-scale Graph Applications on SupercomputersProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673129(158-167)Online publication date: 12-Aug-2024
  • Show More Cited By
  1. Distributed GraphLab: a framework for machine learning and data mining in the cloud

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 5, Issue 8
        April 2012
        96 pages

        Publisher

        VLDB Endowment

        Publication History

        Published: 01 April 2012
        Published in PVLDB Volume 5, Issue 8

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)48
        • Downloads (Last 6 weeks)7
        Reflects downloads up to 09 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Hypergraph-based locality-enhancing methods for graph operations in Big Data applicationsInternational Journal of High Performance Computing Applications10.1177/1094342023121453238:3(210-224)Online publication date: 1-May-2024
        • (2024)A Survey of Distributed Graph Algorithms on Massive GraphsACM Computing Surveys10.1145/369496657:2(1-39)Online publication date: 10-Oct-2024
        • (2024)SuperCSR: A Space-Time-Efficient CSR Representation for Large-scale Graph Applications on SupercomputersProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673129(158-167)Online publication date: 12-Aug-2024
        • (2024)Load Balanced PIM-Based Graph ProcessingACM Transactions on Design Automation of Electronic Systems10.1145/365995129:4(1-22)Online publication date: 21-Jun-2024
        • (2024)Play like a Vertex: A Stackelberg Game Approach for Streaming Graph PartitioningProceedings of the ACM on Management of Data10.1145/36549652:3(1-27)Online publication date: 30-May-2024
        • (2024)CAVE: Concurrency-Aware Graph Processing on SSDsProceedings of the ACM on Management of Data10.1145/36549282:3(1-26)Online publication date: 30-May-2024
        • (2024)Turbo: Efficient Communication Framework for Large-scale Data Processing ClusterProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672241(540-553)Online publication date: 4-Aug-2024
        • (2024)Distributed Graph Neural Network Training: A SurveyACM Computing Surveys10.1145/364835856:8(1-39)Online publication date: 10-Apr-2024
        • (2024)GraphCube: Interconnection Hierarchy-aware Graph ProcessingProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638498(160-174)Online publication date: 2-Mar-2024
        • (2024)FaaSGraph: Enabling Scalable, Efficient, and Cost-Effective Graph Processing with Serverless ComputingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640361(385-400)Online publication date: 27-Apr-2024
        • Show More Cited By

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media