Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Experimental analysis of distributed graph systems

Published: 01 June 2018 Publication History
  • Get Citation Alerts
  • Abstract

    This paper evaluates eight parallel graph processing systems: Hadoop, HaLoop, Vertica, Giraph, GraphLab (PowerGraph), Blogel, Flink Gelly, and GraphX (SPARK) over four very large datasets (Twitter, World Road Network, UK 200705, and ClueWeb) using four workloads (PageRank, WCC, SSSP and K-hop). The main objective is to perform an independent scale-out study by experimentally analyzing the performance, usability, and scalability (using up to 128 machines) of these systems. In addition to performance results, we discuss our experiences in using these systems and suggest some system tuning heuristics that lead to better performance.

    References

    [1]
    Flink. https://flink.apache.org/.
    [2]
    Gelly: Flink graph api. https://ci.apache.org/projects/flink/flink-docs-stable/.
    [3]
    Giraph. http://giraph.apache.org.
    [4]
    Hadoop. http://hadoop.apache.org.
    [5]
    Timely data flow. https://github.com/frankmcsherry/timely-dataflow.
    [6]
    Private correspondence with Blogel team., 2015.
    [7]
    Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016.
    [8]
    Leman Akoglu and Christos Faloutsos. Rtg: a recursive realistic graph generator using random typing. Data Mining and Knowledge Discovery, 19(2):194--209, 2009.
    [9]
    Khaled Ammar and M. Tamer Özsu. Experimental analysis of distributed graph systems. arXiv:1806.08082, 2018.
    [10]
    Khaled Ammar and M.Tamer Özsu. WGB: Towards a universal graph benchmark. In et. al. Rabl, Tilmann, editor, Advancing Big Data Benchmarks, Lecture Notes in Computer Science, pages 58--72. Springer, 2014.
    [11]
    Michael J Anderson, Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Theodore L Willke, and Pradeep Dubey. Graphpad: Optimized graph primitives for parallel and distributed platforms. In Proc. 30th Int. Parallel & Distributed Processing Symp., pages 313--322, 2016.
    [12]
    David A Bader, Guojing Cong, and John Feo. On the architectural requirements for efficient execution of graph algorithms. In Proc. of Parallel Processing, pages 547--556, 2005.
    [13]
    Guillaume Bagan, Angela Bonifati, Radu Ciucanu, George HL Fletcher, Aurélien Lemay, and Nicky Advokaat. gmark: schema-driven generation of graphs and queries. IEEE transactions on knowledge and data engineering, 29(4):856--869, 2017.
    [14]
    Omar Batarfi, RadwaEl Shawi, AymanG. Fayoumi, Reza Nouri, Seyed-Mehdi-Reza Beheshti, Ahmed Barnawi, and Sherif Sakr. Large scale graph processing systems: survey and an experimental evaluation. Cluster Computing, 18(3):1189--1213, 2015.
    [15]
    Scott Beamer, Krste Asanović, and David Patterson. Direction-optimizing breadth-first search. In International Conference on High Performance Computing, Networking, Storage and Analysis, volume 21, pages 12:1--12:10, 2013.
    [16]
    Scott Beamer, Krste Asanović, and David Patterson. The gap benchmark suite. arXiv:1508.03619, 2015.
    [17]
    Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. The HaLoop approach to large-scale iterative data analysis. VLDB J., 21(2):169--190, 2012.
    [18]
    Deepayan Chakrabarti, Christos Faloutsos, and Mary McGlohon. Graph Mining: Laws and Generators. In Proc. Managing and Mining Graph Data, pages 69--123, 2010.
    [19]
    Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. One trillion edges: Graph processing at facebook-scale. PVLDB, 8(12):1804--1815, 2015.
    [20]
    Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. 6th USENIX Symp. on Operating System Design and Implementation, pages 137--149, 2004.
    [21]
    Orri Erling, Alex Averbuch, Josep Larriba-Pey, Hassan Chafi, Andrey Gubichev, Arnau Prat, Minh-Duc Pham, and Peter Boncz. The ldbc social network benchmark: Interactive workload. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 619--630, 2015.
    [22]
    Martin Erwig and Fernuniversitat Hagen. The graph voronoi diagram with applications. Networks, 36:156--163, 2000.
    [23]
    Jing Fan, Adalbert Gerald Soosai Raj, and Jignesh M Patel. The case against specialized graph analytics engines. In Proc. 7th Biennial Conference on Innovative Data Systems Research, pages 1--10, 2015.
    [24]
    Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Proc. 10th USENIX Symp. on Operating System Design and Implementation, pages 17--30, 2012.
    [25]
    Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. Graphx: Graph processing in a distributed dataflow framework. In Proc. 11th USENIX Symp. on Operating System Design and Implementation, pages 599--613, 2014.
    [26]
    Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. Lubm: A benchmark for owl knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web, 3(2--3):158--182, 2005.
    [27]
    Minyang Han, Khuzaima Daudjee, Khaled Ammar, M Tamer Özsu, Xingfang Wang, and Tianqi Jin. An experimental comparison of pregel-like graph processing systems. PVLDB, 7(12):1047--1058, 2014.
    [28]
    Sungpack Hong, Siegfried Depner, Thomas Manhardt, Jan Van Der Lugt, Merijn Verstraaten, and Hassan Chafi. Pgx.d: a fast distributed graph processing engine. In Proc. of Int. Conf. for High Performance Computing, Networking, Storage and Analysis, pages 1--12, 2015.
    [29]
    Alexandru Iosup, Tim Hegeman, Wing Lung Ngai, Stijn Heldens, Arnau Prat-Pérez, Thomas Manhardto, Hassan Chafio, Mihai Capotă, Narayanan Sundaram, Michael Anderson, et al. LDBC graphalytics: A benchmark for large-scale graph analysis on parallel and distributed platforms. PVLDB, 9(13):1317--1328, 2016.
    [30]
    Alekh Jindal, Samuel Madden, Malu Castellanos, and Meichun Hsu. Graph analytics using vertica relational database. In Proc. IEEE International Conference on Big Data, pages 1191--1200, 2015.
    [31]
    U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. PEGASUS: a peta-scale graph mining system implementation and observations. In Proc. 2009 IEEE Int. Conf. on Data Mining, pages 229--238, 2009.
    [32]
    Raimondas Kiveris, Silvio Lattanzi, Vahab Mirrokni, Vibhor Rastogi, and Sergei Vassilvitskii. Connected components in mapreduce and beyond. In Proc. 5nd ACM Symp. on Cloud Computing, pages 18:1--18:13, 2014.
    [33]
    Kishore Kothapalli, Jyothish Soman, and PJ Narayanan. Fast GPU algorithms for graph connectivity. In Proc. Workshop on Large Scale Parallel Processing, pages 66--75, 2010.
    [34]
    Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proc. 11th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 177--187, 2005.
    [35]
    Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: Densification laws, shrinking diameters and possible explanations. In Proc. 11th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 177--187, 2005.
    [36]
    Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. Distributed GraphLab: A framework for machine learning in the cloud. PVLDB, 5(8):716--727, 2012.
    [37]
    Yi Lu, James Cheng, Da Yan, and Huanhuan Wu. Large-scale distributed graph computing systems: An experimental evaluation. PVLDB, 8(3):281--292, 2014.
    [38]
    Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 135--146, 2010.
    [39]
    Frank McSherry, Michael Isard, and Derek G. Murray. Scalability! but at what cost? In Proc of the 15th USENIX Conference on Hot Topics in Operating Systems, 2015.
    [40]
    Frank McSherry, Derek G Murray, Rebecca Isaacs, and Michael Isard. Differential dataflow. In Proc. 6th Biennial Conference on Innovative Data Systems Research, 2013.
    [41]
    Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. Naiad: A timely dataflow system. In Proc. 24th ACM Symp. on Operating System Principles, pages 439--455, 2013.
    [42]
    Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M Tamer Özsu. The ubiquity of large graphs and surprising challenges of graph processing. PVLDB, 11(4), 2017.
    [43]
    Semih Salihoglu and Jennifer Widom. GPS: A graph processing system. In Proc. 25th Int. Conf. on Scientific and Statistical Database Management, pages 1--12, 2013.
    [44]
    Yossi Shiloach and Uzi Vishkin. An o (logn) parallel connectivity algorithm. Journal of Algorithms, 3(1):57--67, 1982.
    [45]
    Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, and John McPherson. From "think like a vertex" to "think like a graph". PVLDB, 7(3):193--204, 2013.
    [46]
    Shiv Verma, Luke M. Leslie, Yosub Shin, and Indranil Gupta. An experimental comparison of partitioning strategies in distributed graph processing. PVLDB, 10(5):493--504, 2017.
    [47]
    Da Yan, James Cheng, Yi Lu, and Wilfred Ng. Blogel: A block-centric framework for distributed computation on real-world graphs. PVLDB, 7(14):1981--1992, 2014.
    [48]
    Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. 9th USENIX Symp. on Networked Systems Design and Implementation, pages 15--28, 2012.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 11, Issue 10
    June 2018
    248 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 June 2018
    Published in PVLDB Volume 11, Issue 10

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)1

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Big SQL systems: an experimental evaluationCluster Computing10.1007/s10586-019-02914-422:4(1347-1377)Online publication date: 11-Mar-2022
    • (2022)ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph ProcessingWeb and Big Data10.1007/978-3-031-25158-0_5(45-59)Online publication date: 11-Aug-2022
    • (2021)Dynamic Load Balancing Method for Urban Surveillance Video Big Data Storage Based on HDFSProceedings of the 2021 7th International Conference on Computing and Artificial Intelligence10.1145/3467707.3467730(160-167)Online publication date: 23-Apr-2021
    • (2021)ElGAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3480857(1-15)Online publication date: 14-Nov-2021
    • (2021)Vertex-centric Parallel Computation of SQL QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457314(1664-1677)Online publication date: 9-Jun-2021
    • (2020)SageProceedings of the VLDB Endowment10.14778/3397230.339725113:9(1598-1613)Online publication date: 1-May-2020
    • (2020)Discovering Graph Functional DependenciesACM Transactions on Database Systems10.1145/339719845:3(1-42)Online publication date: 11-Sep-2020
    • (2020)G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph MatchingProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389702(1099-1114)Online publication date: 11-Jun-2020
    • (2019)CGraphACM Transactions on Storage10.1145/331940615:2(1-26)Online publication date: 20-Apr-2019
    • (2019)Experimental Analysis of Streaming Algorithms for Graph PartitioningProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3300076(1375-1392)Online publication date: 25-Jun-2019

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media