research-article

Distributed GraphLab: a framework for machine learning and data mining in the cloud

Authors:

Joseph Gonzalez,

Carlos Guestrin,

Joseph M. HellersteinAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 5, Issue 8

Pages 716 - 727

https://doi.org/10.14778/2212351.2212354

Published: 01 April 2012 Publication History

Abstract

While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees.

We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

References

[1]

R. Angles and C. Gutierrez. Survey of graph database models. ACM Comput. Surv., 40(1):1:1--1:39, 2008.

Digital Library

[2]

A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, pages 81--88. 2008.

[3]

D. Batra, A. Kowdle, D. Parikh, L. Jiebo, and C. Tsuhan. iCoseg: Interactive co-segmentation with intelligent scribble guidance. In CVPR, pages 3169--3176, 2010.

[4]

D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods. Prentice-Hall, Inc., 1989.

Digital Library

[5]

A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.

Digital Library

[6]

K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63--75, 1985.

Digital Library

[7]

R. Chen, X. Weng, B. He, and M. Yang. Large graph processing in the cloud. In SIGMOD, pages 1123--1126, 2010.

Digital Library

[8]

C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages 281--288. 2006.

Digital Library

[9]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, 2004.

Digital Library

[10]

B. Efron, T. Hastie, I. M. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407--499, 2004.

[11]

G. Elidan, I. McGraw, and D. Koller. Residual Belief Propagation: Informed scheduling for asynchronous message passing. In UAI, pages 165--173, 2006.

Digital Library

[12]

J. Gonzalez, Y. Low, A. Gretton, and C. Guestrin. Parallel gibbs sampling: From colored fields to thin junction trees. In AISTATS, volume 15, pages 324--332, 2011.

[13]

J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS, volume 5, pages 177--184, 2009.

[14]

J. Gonzalez, Y. Low, C. Guestrin, and D. O'Hallaron. Distributed parallel inference on large factor graphs. In UAI, 2009.

Digital Library

[15]

H. Graf, E. Cosatto, L. Bottou, I. Dourdanovic, and V. Vapnik. Parallel support vector machines: The cascade SVM. In NIPS, pages 521--528, 2004.

Digital Library

[16]

D. Gregor and A. Lumsdaine. The parallel BGL: A generic library for distributed graph computations. POOSC, 2005.

[17]

A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber. Comparative evaluation of latency reducing and tolerating techniques. SIGARCH Comput. Archit. News, 19(3):254--263, 1991.

Digital Library

[18]

B. Hindman, A. Konwinski, M. Zaharia, and I. Stoica. A common substrate for cluster computing. In HotCloud, 2009.

Digital Library

[19]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.

Digital Library

[20]

U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In ICDM, pages 229--238, 2009.

Digital Library

[21]

G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput., 48(1):96--129, 1998.

Digital Library

[22]

S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In SPAA, pages 85--94, 2011.

Digital Library

[23]

J. Leskovec. Stanford large network dataset collection. http://snap.stanford.edu/data/index.html, 2011.

[24]

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In UAI, pages 340--349, 2010.

[25]

G. Malewicz, M. H. Austern, A. J. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.

Digital Library

[26]

J. Misra. Detecting termination of distributed computations using markers. In PODC, pages 290--294, 1983.

Digital Library

[27]

R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational EM for latent Dirichlet allocation: An experimental evaluation of speed and scalability. In ICDM Workshops, pages 349--354, 2007.

Digital Library

[28]

R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355--368. 1998.

[29]

Neo4j. http://neo4j.org, 2011.

[30]

D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In NIPS, pages 1081--1088, 2007.

Digital Library

[31]

L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, 1999.

[32]

R. Pearce, M. Gokhale, and N. Amato. Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory. In SC, pages 1--11, 2010.

Digital Library

[33]

R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In OSDI, 2010.

Digital Library

[34]

A. G. Siapas. Criticality and parallelism in combinatorial optimization. PhD thesis, Massachusetts Institute of Technology, 1996.

Digital Library

[35]

A. J. Smola and S. Narayanamurthy. An Architecture for Parallel Topic Models. PVLDB, 3(1):703--710, 2010.

Digital Library

[36]

S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011.

Digital Library

[37]

J. W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17:530--531, 1974.

Digital Library

[38]

M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In HotCloud, 2010.

Digital Library

[39]

Y. Zhang, Q. Gao, L. Gao, and C. Wang. Priter: a distributed framework for prioritized iterative computations. In SOCC, pages 13:1--13:14, 2011.

Digital Library

[40]

Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In AAIM, pages 337--348, 2008.

Digital Library

Cited By

Akbudak K(2024)Hypergraph-based locality-enhancing methods for graph operations in Big Data applicationsInternational Journal of High Performance Computing Applications10.1177/1094342023121453238:3(210-224)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1177/10943420231214532
Meng LShao YYuan LLai LCheng PLi XYu WZhang WLin XZhou J(2024)A Survey of Distributed Graph Algorithms on Massive GraphsACM Computing Surveys10.1145/3694966Online publication date: 5-Sep-2024
https://dl.acm.org/doi/10.1145/3694966
Gan XLi TZhang QYang BChen XLiu J(2024)SuperCSR: A Space-Time-Efficient CSR Representation for Large-scale Graph Applications on SupercomputersProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673129(158-167)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673129
Show More Cited By

Distributed GraphLab: a framework for machine learning and data mining in the cloud

Recommendations

GraphLab: a new framework for parallel machine learning
UAI'10: Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence

Designing and implementing efficient, provably correct parallel machine learning (ML) algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML ...
Usability in machine learning at scale with graphlab
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
Today, machine learning (ML) methods play a central role in industry and science. The growth of the Web and improvements in sensor data collection technology have been rapidly increasing the magnitude and complexity of the ML tasks we must solve. This ...
Distributed Computing in Big Data Analytics: Concepts, Technologies and Applications

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 5, Issue 8

April 2012

96 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 April 2012

Published in PVLDB Volume 5, Issue 8

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

650
Total Citations
View Citations
6,716
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)6

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Akbudak K(2024)Hypergraph-based locality-enhancing methods for graph operations in Big Data applicationsInternational Journal of High Performance Computing Applications10.1177/1094342023121453238:3(210-224)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1177/10943420231214532
Meng LShao YYuan LLai LCheng PLi XYu WZhang WLin XZhou J(2024)A Survey of Distributed Graph Algorithms on Massive GraphsACM Computing Surveys10.1145/3694966Online publication date: 5-Sep-2024
https://dl.acm.org/doi/10.1145/3694966
Gan XLi TZhang QYang BChen XLiu J(2024)SuperCSR: A Space-Time-Efficient CSR Representation for Large-scale Graph Applications on SupercomputersProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673129(158-167)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673129
Zhao XChen SKang Y(2024)Load Balanced PIM-Based Graph ProcessingACM Transactions on Design Automation of Electronic Systems10.1145/365995129:4(1-22)Online publication date: 21-Jun-2024
https://dl.acm.org/doi/10.1145/3659951
Ding ZXiang YWang SXie XZhou S(2024)Play like a Vertex: A Stackelberg Game Approach for Streaming Graph PartitioningProceedings of the ACM on Management of Data10.1145/36549652:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654965
Papon TChen TZhang SAthanassoulis M(2024)CAVE: Concurrency-Aware Graph Processing on SSDsProceedings of the ACM on Management of Data10.1145/36549282:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654928
Jia XYao ZPeng CZhao ZLei BLiu ELi XHe ZWang YZou XZhao CChu JWang JMiao CSekar VYu MSeneviratne AVeitch D(2024)Turbo: Efficient Communication Framework for Large-scale Data Processing ClusterProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672241(540-553)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672241
Shao YLi HGu XYin HLi YMiao XZhang WCui BChen L(2024)Distributed Graph Neural Network Training: A SurveyACM Computing Surveys10.1145/364835856:8(1-39)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3648358
Gan XWu GQiu SXiong FSi JFang JDong DGong CLi TWang ZLee IChabbi MSteuwer M(2024)GraphCube: Interconnection Hierarchy-aware Graph ProcessingProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638498(160-174)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638498
Liu YSun SLi ZChen QGao SHe BLi CGuo MTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)FaaSGraph: Enabling Scalable, Efficient, and Cost-Effective Graph Processing with Serverless ComputingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640361(385-400)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640361
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents