Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

One trillion edges: graph processing at Facebook-scale

Published: 01 August 2015 Publication History

Abstract

Analyzing large graphs provides valuable insights for social networking and web companies in content ranking and recommendations. While numerous graph processing systems have been developed and evaluated on available benchmark graphs of up to 6.6B edges, they often face significant difficulties in scaling to much larger graphs. Industry graphs can be two orders of magnitude larger - hundreds of billions or up to one trillion edges. In addition to scalability challenges, real world applications often require much more complex graph processing workflows than previously evaluated. In this paper, we describe the usability, performance, and scalability improvements we made to Apache Giraph, an open-source graph processing system, in order to use it on Facebook-scale graphs of up to one trillion edges. We also describe several key extensions to the original Pregel model that make it possible to develop a broader range of production graph applications and workflows as well as improve code reuse. Finally, we report on real-world operations as well as performance characteristics of several large-scale production applications.

References

[1]
Apache giraph - http://giraph.apache.org.
[2]
Apache hadoop. http://hadoop.apache.org/.
[3]
Apache mahout - http://mahout.apache.org.
[4]
Beevolve twitter study. http://www.beevolve.com/twitter-statistics.
[5]
Giraph jira. https://issues.apache.org/jira/browse/GIRAPH.
[6]
Netty - http://netty.io.
[7]
Open graph. https://developers.facebook.com/docs/opengraph.
[8]
Yahoo! altavista web page hyperlink connectivity graph, circa 2002, 2012. http://webscope.sandbox.yahoo.com/.
[9]
L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: Membership, growth, and evolution. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '06, pages 44--54, New York, NY, USA, 2006. ACM.
[10]
P. Boldi, M. Santini, and S. Vigna. A large time-aware graph. SIGIR Forum, 42(2):33--38, 2008.
[11]
V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, ICDE '11, pages 1151--1162, Washington, DC, USA, 2011. IEEE Computer Society.
[12]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the seventh international conference on World Wide Web 7, WWW7, pages 107--117, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V.
[13]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow., 3(1-2):285--296, Sept. 2010.
[14]
Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, and C. Jermaine. A comparison of platforms for implementing and running very large scale machine learning algorithms. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1371--1382, New York, NY, USA, 2014. ACM.
[15]
R. Chen, M. Yang, X. Weng, B. Choi, B. He, and X. Li. Improving large graph processing on partitioned graphs in the cloud. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC '12, pages 3: 1--3:13, New York, NY, USA, 2012. ACM.
[16]
M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 48(1-3):253--285, 2002.
[17]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008.
[18]
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. hee Bae, J. Qiu, and G. Fox. Twister: A runtime for iterative mapreduce. In In The First International Workshop on MapReduce and its Applications, 2010.
[19]
B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315: 972--976, 2007.
[20]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, OSDI'12, pages 17--30, Berkeley, CA, USA, 2012. USENIX Association.
[21]
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 599--613, Broomfield, CO, Oct. 2014. USENIX Association.
[22]
D. Gregor and A. Lumsdaine. The Parallel BGL: A generic library for distributed graph computations. In Parallel Object-Oriented Scientific Computing (POOSC), 07/2005 2005. Accepted.
[23]
P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh. Wtf: the who to follow service at twitter. In Proceedings of the 22nd international conference on World Wide Web, WWW '13, pages 505--514, Republic and Canton of Geneva, Switzerland, 2013. International World Wide Web Conferences Steering Committee.
[24]
M. Han, K. Daudjee, K. Ammar, M. T. Özsu, X. Wang, and T. Jin. An experimental comparison of pregel-like graph processing systems. PVLDB, 7(12):1047--1058, 2014.
[25]
S. Hong, H. Chafi, E. Sedlar, and K. Olukotun. Green-marl: a dsl for easy and efficient graph analysis. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 349--362, New York, NY, USA, 2012. ACM.
[26]
J. Huang, D. J. Abadi, and K. Ren. Scalable sparql querying of large rdf graphs. PVLDB, 4(11):1123--1134, 2011.
[27]
P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference, USENIXATC'10, pages 11--11, Berkeley, CA, USA, 2010. USENIX Association.
[28]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59--72, New York, NY, USA, 2007. ACM.
[29]
N. Jain, G. Liao, and T. L. Willke. Graphbuilder: scalable graph etl framework. In First International Workshop on Graph Data Management Experiences and Systems, GRADES '13, pages 4:1--4:6, New York, NY, USA, 2013. ACM.
[30]
U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM '09, pages 229--238, Washington, DC, USA, 2009. IEEE Computer Society.
[31]
H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web, WWW '10, pages 591--600, New York, NY, USA, 2010. ACM.
[32]
A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: large-scale graph computation on just a pc. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, OSDI'12, pages 31--46, Berkeley, CA, USA, 2012. USENIX Association.
[33]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD '10, pages 135--146, New York, NY, USA, 2010. ACM.
[34]
J. Nelson, B. Myers, A. H. Hunter, P. Briggs, L. Ceze, C. Ebeling, D. Grossman, S. Kahan, and M. Oskin. Crunching large graphs with commodity processors. In Proceedings of the 3rd USENIX conference on Hot topic in parallelism, pages 10--10. USENIX Association, 2011.
[35]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD '08, pages 1099--1110, New York, NY, USA, 2008. ACM.
[36]
R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--14, Berkeley, CA, USA, 2010. USENIX Association.
[37]
S. Salihoglu and J. Widom. Gps: A graph processing system. In Scientific and Statistical Database Management. Stanford InfoLab, July 2013.
[38]
B. Shao, H. Wang, and Y. Li. Trinity: a distributed graph engine on a memory cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 505--516, New York, NY, USA, 2013. ACM.
[39]
P. Stutz, A. Bernstein, and W. Cohen. Signal/collect: graph algorithms for the (semantic) web. In Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I, ISWC'10, pages 764--780, Berlin, Heidelberg, 2010. Springer-Verlag.
[40]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2):1626--1629, Aug. 2009.
[41]
J. Ugander and L. Backstrom. Balanced label propagation for partitioning massive graphs. In Proceedings of the sixth ACM international conference on Web search and data mining, WSDM '13, pages 507--516, New York, NY, USA, 2013. ACM.
[42]
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990.
[43]
S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber. Presto: distributed machine learning and graph processing with sparse matrices. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys '13, pages 197--210, New York, NY, USA, 2013. ACM.
[44]
G. Wang, W. Xie, A. J. Demers, and J. Gehrke. Asynchronous large-scale graph processing made easy. In CIDR, 2013.
[45]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, HotCloud'10, pages 10--10, Berkeley, CA, USA, 2010. USENIX Association.
[46]
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Technical Report CMU-CALD-02-107, 2002.

Cited By

View all
  • (2024)Streaming Graph Algorithms in the Massively Parallel Computation ModelProceedings of the 43rd ACM Symposium on Principles of Distributed Computing10.1145/3662158.3662770(496-507)Online publication date: 17-Jun-2024
  • (2024)In situ neighborhood sampling for large-scale GNN trainingProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663443(1-5)Online publication date: 10-Jun-2024
  • (2024)GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core SystemsACM Transactions on Architecture and Code Optimization10.1145/366199821:3(1-25)Online publication date: 26-Apr-2024
  • Show More Cited By

Index Terms

  1. One trillion edges: graph processing at Facebook-scale
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 8, Issue 12
    Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
    August 2015
    728 pages
    ISSN:2150-8097
    • Editors:
    • Chen Li,
    • Volker Markl
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2015
    Published in PVLDB Volume 8, Issue 12

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)91
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Streaming Graph Algorithms in the Massively Parallel Computation ModelProceedings of the 43rd ACM Symposium on Principles of Distributed Computing10.1145/3662158.3662770(496-507)Online publication date: 17-Jun-2024
    • (2024)In situ neighborhood sampling for large-scale GNN trainingProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663443(1-5)Online publication date: 10-Jun-2024
    • (2024)GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core SystemsACM Transactions on Architecture and Code Optimization10.1145/366199821:3(1-25)Online publication date: 26-Apr-2024
    • (2024)Graph Summarization: Compactness Meets EfficiencyProceedings of the ACM on Management of Data10.1145/36549432:3(1-26)Online publication date: 30-May-2024
    • (2024)GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)ACM Transactions on Database Systems10.1145/364384649:3(1-31)Online publication date: 16-May-2024
    • (2024)Ginkgo-P: General Illustrations of Knowledge Graphs for Openness as a PlatformProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635701(1066-1069)Online publication date: 4-Mar-2024
    • (2024)An Energy-Efficient In-Memory Accelerator for Graph Construction and UpdatingIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.335503843:6(1781-1793)Online publication date: 18-Jan-2024
    • (2023)Layer-neighbor sampling — defusing neighborhood explosion in GNNsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667245(25819-25836)Online publication date: 10-Dec-2023
    • (2023)Extending the Life of Old Systems with More MemoryProceedings of the International Symposium on Memory Systems10.1145/3631882.3631900(1-3)Online publication date: 2-Oct-2023
    • (2023)Large-scale Graph Processing on Commodity Systems: Understanding and Mitigating the Impact of SwappingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631884(1-11)Online publication date: 2-Oct-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media