research-article

One trillion edges: graph processing at Facebook-scale

Editors: Chen Li, Volker Markl Authors:

Dionysios Logothetis,

Sambavi MuthukrishnanAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 8, Issue 12

Pages 1804 - 1815

https://doi.org/10.14778/2824032.2824077

Published: 01 August 2015 Publication History

Abstract

Analyzing large graphs provides valuable insights for social networking and web companies in content ranking and recommendations. While numerous graph processing systems have been developed and evaluated on available benchmark graphs of up to 6.6B edges, they often face significant difficulties in scaling to much larger graphs. Industry graphs can be two orders of magnitude larger - hundreds of billions or up to one trillion edges. In addition to scalability challenges, real world applications often require much more complex graph processing workflows than previously evaluated. In this paper, we describe the usability, performance, and scalability improvements we made to Apache Giraph, an open-source graph processing system, in order to use it on Facebook-scale graphs of up to one trillion edges. We also describe several key extensions to the original Pregel model that make it possible to develop a broader range of production graph applications and workflows as well as improve code reuse. Finally, we report on real-world operations as well as performance characteristics of several large-scale production applications.

References

[1]

Apache giraph - http://giraph.apache.org.

[2]

Apache hadoop. http://hadoop.apache.org/.

[3]

Apache mahout - http://mahout.apache.org.

[4]

Beevolve twitter study. http://www.beevolve.com/twitter-statistics.

[5]

Giraph jira. https://issues.apache.org/jira/browse/GIRAPH.

[6]

Netty - http://netty.io.

[7]

Open graph. https://developers.facebook.com/docs/opengraph.

[8]

Yahoo! altavista web page hyperlink connectivity graph, circa 2002, 2012. http://webscope.sandbox.yahoo.com/.

[9]

L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: Membership, growth, and evolution. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '06, pages 44--54, New York, NY, USA, 2006. ACM.

[10]

P. Boldi, M. Santini, and S. Vigna. A large time-aware graph. SIGIR Forum, 42(2):33--38, 2008.

[11]

V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, ICDE '11, pages 1151--1162, Washington, DC, USA, 2011. IEEE Computer Society.

[12]

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the seventh international conference on World Wide Web 7, WWW7, pages 107--117, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V.

[13]

Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow., 3(1-2):285--296, Sept. 2010.

[14]

Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, and C. Jermaine. A comparison of platforms for implementing and running very large scale machine learning algorithms. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1371--1382, New York, NY, USA, 2014. ACM.

[15]

R. Chen, M. Yang, X. Weng, B. Choi, B. He, and X. Li. Improving large graph processing on partitioned graphs in the cloud. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC '12, pages 3: 1--3:13, New York, NY, USA, 2012. ACM.

[16]

M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 48(1-3):253--285, 2002.

[17]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008.

[18]

J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. hee Bae, J. Qiu, and G. Fox. Twister: A runtime for iterative mapreduce. In In The First International Workshop on MapReduce and its Applications, 2010.

[19]

B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315: 972--976, 2007.

[20]

J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, OSDI'12, pages 17--30, Berkeley, CA, USA, 2012. USENIX Association.

[21]

J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 599--613, Broomfield, CO, Oct. 2014. USENIX Association.

[22]

D. Gregor and A. Lumsdaine. The Parallel BGL: A generic library for distributed graph computations. In Parallel Object-Oriented Scientific Computing (POOSC), 07/2005 2005. Accepted.

[23]

P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh. Wtf: the who to follow service at twitter. In Proceedings of the 22nd international conference on World Wide Web, WWW '13, pages 505--514, Republic and Canton of Geneva, Switzerland, 2013. International World Wide Web Conferences Steering Committee.

[24]

M. Han, K. Daudjee, K. Ammar, M. T. Özsu, X. Wang, and T. Jin. An experimental comparison of pregel-like graph processing systems. PVLDB, 7(12):1047--1058, 2014.

[25]

S. Hong, H. Chafi, E. Sedlar, and K. Olukotun. Green-marl: a dsl for easy and efficient graph analysis. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 349--362, New York, NY, USA, 2012. ACM.

[26]

J. Huang, D. J. Abadi, and K. Ren. Scalable sparql querying of large rdf graphs. PVLDB, 4(11):1123--1134, 2011.

[27]

P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference, USENIXATC'10, pages 11--11, Berkeley, CA, USA, 2010. USENIX Association.

[28]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59--72, New York, NY, USA, 2007. ACM.

[29]

N. Jain, G. Liao, and T. L. Willke. Graphbuilder: scalable graph etl framework. In First International Workshop on Graph Data Management Experiences and Systems, GRADES '13, pages 4:1--4:6, New York, NY, USA, 2013. ACM.

[30]

U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM '09, pages 229--238, Washington, DC, USA, 2009. IEEE Computer Society.

[31]

H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web, WWW '10, pages 591--600, New York, NY, USA, 2010. ACM.

[32]

A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: large-scale graph computation on just a pc. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, OSDI'12, pages 31--46, Berkeley, CA, USA, 2012. USENIX Association.

[33]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD '10, pages 135--146, New York, NY, USA, 2010. ACM.

[34]

J. Nelson, B. Myers, A. H. Hunter, P. Briggs, L. Ceze, C. Ebeling, D. Grossman, S. Kahan, and M. Oskin. Crunching large graphs with commodity processors. In Proceedings of the 3rd USENIX conference on Hot topic in parallelism, pages 10--10. USENIX Association, 2011.

[35]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD '08, pages 1099--1110, New York, NY, USA, 2008. ACM.

[36]

R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--14, Berkeley, CA, USA, 2010. USENIX Association.

[37]

S. Salihoglu and J. Widom. Gps: A graph processing system. In Scientific and Statistical Database Management. Stanford InfoLab, July 2013.

[38]

B. Shao, H. Wang, and Y. Li. Trinity: a distributed graph engine on a memory cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 505--516, New York, NY, USA, 2013. ACM.

[39]

P. Stutz, A. Bernstein, and W. Cohen. Signal/collect: graph algorithms for the (semantic) web. In Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I, ISWC'10, pages 764--780, Berlin, Heidelberg, 2010. Springer-Verlag.

[40]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2):1626--1629, Aug. 2009.

[41]

J. Ugander and L. Backstrom. Balanced label propagation for partitioning massive graphs. In Proceedings of the sixth ACM international conference on Web search and data mining, WSDM '13, pages 507--516, New York, NY, USA, 2013. ACM.

[42]

L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990.

[43]

S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber. Presto: distributed machine learning and graph processing with sparse matrices. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys '13, pages 197--210, New York, NY, USA, 2013. ACM.

[44]

G. Wang, W. Xie, A. J. Demers, and J. Gehrke. Asynchronous large-scale graph processing made easy. In CIDR, 2013.

[45]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, HotCloud'10, pages 10--10, Berkeley, CA, USA, 2010. USENIX Association.

[46]

X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Technical Report CMU-CALD-02-107, 2002.

Cited By

Czumaj AMishra GMukherjee AKuznetsov PGelles ROlivetti D(2024)Streaming Graph Algorithms in the Massively Parallel Computation ModelProceedings of the 43rd ACM Symposium on Principles of Distributed Computing10.1145/3662158.3662770(496-507)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3662158.3662770
Song YChen PLu YAbrar NKalavri V(2024)In situ neighborhood sampling for large-scale GNN trainingProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663443(1-5)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663443
Li JKang Y(2024)GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core SystemsACM Transactions on Architecture and Code Optimization10.1145/366199821:3(1-25)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3661998
Show More Cited By

Index Terms

One trillion edges: graph processing at Facebook-scale
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Prescribed edges and forbidden edges for a cycle in a planar graph

In 1956, Tutte proved that a 4-connected planar graph is Hamiltonian. Moreover, in 1997, Sanders extended this to the result that a 4-connected planar graph contains a Hamiltonian cycle through any two of its edges. Harant and Senitsch [J. Harant, S. ...
Trivially noncontractible edges in a contraction critically 5-connected graph

An edge of a k-connected graph is said to be k-contractible if the contraction of the edge results in a k-connected graph. A k-connected graph with no k-contractible edge is said to be contraction critically k-connected. An edge of a k-connected graph ...
Contractible edges in minimally k-connected graphs

An edge of a k-connected graph is said to be k-contractible if the contraction of the edge results in a k-connected graph. In this paper, we prove that a (K"1+C"4)-free minimally k-connected graph has a k-contractible edge, if incident to each vertex of ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 8, Issue 12

Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii

August 2015

728 pages

ISSN:2150-8097

Editors:
Chen Li
University of California, Irvine
,
Volker Markl
TU Berlin

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2015

Published in PVLDB Volume 8, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

151
Total Citations
View Citations
1,552
Total Downloads

Downloads (Last 12 months)91
Downloads (Last 6 weeks)8

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Czumaj AMishra GMukherjee AKuznetsov PGelles ROlivetti D(2024)Streaming Graph Algorithms in the Massively Parallel Computation ModelProceedings of the 43rd ACM Symposium on Principles of Distributed Computing10.1145/3662158.3662770(496-507)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3662158.3662770
Song YChen PLu YAbrar NKalavri V(2024)In situ neighborhood sampling for large-scale GNN trainingProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663443(1-5)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663443
Li JKang Y(2024)GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core SystemsACM Transactions on Architecture and Code Optimization10.1145/366199821:3(1-25)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3661998
Chu DZhang FZhang WZhang YLin X(2024)Graph Summarization: Compactness Meets EfficiencyProceedings of the ACM on Management of Data10.1145/36549432:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654943
Tench DWest EZhang VBender MChowdhury ADelayo DDellas JFarach-Colton MSeip TZhang K(2024)GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)ACM Transactions on Database Systems10.1145/364384649:3(1-31)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3643846
Hill BLiu LTong HAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)Ginkgo-P: General Illustrations of Knowledge Graphs for Openness as a PlatformProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635701(1066-1069)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635701
Chen MLiu CLiang SHe LWang YZhang LLi HLi X(2024)An Energy-Efficient In-Memory Accelerator for Graph Construction and UpdatingIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.335503843:6(1781-1793)Online publication date: 18-Jan-2024
https://dl.acm.org/doi/10.1109/TCAD.2024.3355038
Balın MÇatalyürek ÜOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Layer-neighbor sampling — defusing neighborhood explosion in GNNsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667245(25819-25836)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667245
Kim EBersatti AKim H(2023)Extending the Life of Old Systems with More MemoryProceedings of the International Symposium on Memory Systems10.1145/3631882.3631900(1-3)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631900
Haddadi ABlack-Schaffer DPark C(2023)Large-scale Graph Processing on Commodity Systems: Understanding and Mitigating the Impact of SwappingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631884(1-11)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631884
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents