Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Pregelix: Big(ger) graph analytics on a dataflow engine

Published: 01 October 2014 Publication History
  • Get Citation Alerts
  • Abstract

    There is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large graph datasets. Unfortunately, this challenge has not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many graph processing systems follow. Pregelix is a new open source distributed graph processing system that is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15X speedup compared to Apache Giraph and up to 35X speedup compared to distributed GraphLab), and more effective use of available machine resources to support Big(ger) Graph Analytics.

    References

    [1]
    AsterixDB. http://asterixdb.ics.uci.edu.
    [2]
    BTC. http://km.aifb.kit.edu/projects/btc-2009/.
    [3]
    Genomix. https://github.com/uci-cbcl/genomix.
    [4]
    Giraph. http://giraph.apache.org/.
    [5]
    Hadoop/HDFS. http://hadoop.apache.org/.
    [6]
    Hama. http://hama.apache.org/.
    [7]
    Pivotal. http://www.gopivotal.com/products/pivotal-greenplum-database.
    [8]
    Teradata. http://www.teradata.com.
    [9]
    Vertica. http://www.vertica.com.
    [10]
    F. Bancilhon and R. Ramakrishnan. An amateur's introduction to recursive query processing strategies. In SIGMOD, pages 16--52, 1986.
    [11]
    D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In SoCC, pages 119--130, 2010.
    [12]
    V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011.
    [13]
    Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Scaling datalog for machine learning on big data. CoRR, abs/1203.0160, 2012.
    [14]
    Y. Bu, V. R. Borkar, G. H. Xu, and M. J. Carey. A bloat-aware design for big data applications. In ISMM, pages 119--130, 2013.
    [15]
    Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. PVLDB, 3(1):285--296, 2010.
    [16]
    S. Chaudhuri. An overview of query optimization in relational systems. In PODS, pages 34--43, 1998.
    [17]
    R. Chen, M. Yang, X. Weng, B. Choi, B. He, and X. Li. Improving large graph processing on partitioned graphs in the cloud. In SoCC, page 3, 2012.
    [18]
    J. Cheng, Y. Ke, S. Chu, and C. Cheng. Efficient processing of distance queries in large graphs: a vertex cover approach. In SIGMOD Conference, pages 457--468, 2012.
    [19]
    D. Comer. The ubiquitous b-tree. ACM Comput. Surv., 11(2):121--137, 1979.
    [20]
    J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
    [21]
    D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H.-I. Hsiao, and R. Rasmussen. The Gamma database machine project. IEEE Trans. Knowl. Data Eng., 2(1):44--62, 1990.
    [22]
    D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.
    [23]
    S. Even. Graph Algorithms. Cambridge University Press, New York, NY, USA, 2nd edition, 2011.
    [24]
    S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. PVLDB, 5(11):1268--1279, 2012.
    [25]
    S. Fushimi, M. Kitsuregawa, and H. Tanaka. An overview of the system software of a parallel relational database machine grace. In VLDB, pages 209--219, 1986.
    [26]
    A. V. Gerbessiotis and L. G. Valiant. Direct bulk-synchronous parallel algorithms. J. Parallel Distrib. Comput., 22(2):251--267, 1994.
    [27]
    G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2):73--170, 1993.
    [28]
    H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB, 4(11):1111--1122, 2011.
    [29]
    I. Hoque and I. Gupta. LFGraph: Simple and fast distributed graph analytics. In TRIOS, 2013.
    [30]
    M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.
    [31]
    Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning in the cloud. PVLDB, 5(8):716--727, 2012.
    [32]
    G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135--146, 2010.
    [33]
    S. R. Mihaylov, Z. G. Ives, and S. Guha. REX: Recursive, delta-based data-centric computation. PVLDB, 5(11):1280--1291, 2012.
    [34]
    P. E. O'Neil, E. Cheng, D. Gawlick, and E. J. O'Neil. The log-structured merge-tree (LSM-tree). Acta Inf., 33(4):351--385, 1996.
    [35]
    L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999.
    [36]
    S. Salihoglu and J. Widom. GPS: a graph processing system. In SSDBM, page 22, 2013.
    [37]
    B. Shao, H. Wang, and Y. Li. Trinity: a distributed graph engine on a memory cloud. In SIGMOD Conference, pages 505--516, 2013.
    [38]
    Y. Tian, A. Balmin, S. A. Corsten, S. Tatikonda, and J. McPherson. From "think like a vertex" to "think like a graph". PVLDB, 7(3):193--204, 2013.
    [39]
    V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: yet another resource negotiator. In SoCC, page 5, 2013.
    [40]
    R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. GraphX: a resilient distributed graph system on spark. In GRADES, page 2, 2013.
    [41]
    Yahoo! Webscope Program. http://webscope.sandbox.yahoo.com/.
    [42]
    D. Yan, J. Cheng, K. Xing, W. Ng, and Y. Bu. Practical pregel algorithms for massive graphs. In Technique Report, CUHK, 2013.
    [43]
    S. Yang, X. Yan, B. Zong, and A. Khan. Towards effective partition management for large graphs. In SIGMOD Conference, pages 517--528, 2012.
    [44]
    M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computingt. In NSDI, 2012.
    [45]
    D. R. Zerbino and E. Birney. Velvet: Algorithms for de novo short read assembly using de bruijn graphs. Genome Research In Genome Research, 18(5):821--829, 2008.
    [46]
    Y. Zhang, Q. Gao, L. Gao, and C. Wang. PrIter: a distributed framework for prioritized iterative computations. In SoCC, page 13, 2011.

    Cited By

    View all
    • (2021)Systemizing Interprocedural Static Analysis of Large-scale Systems Code with GraspanACM Transactions on Computer Systems10.1145/346682038:1-2(1-39)Online publication date: 29-Jul-2021
    • (2021)Vertex-centric Parallel Computation of SQL QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457314(1664-1677)Online publication date: 9-Jun-2021
    • (2021)G-thinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00688-z31:2(287-320)Online publication date: 4-Aug-2021
    • Show More Cited By

    Index Terms

    1. Pregelix: Big(ger) graph analytics on a dataflow engine
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 8, Issue 2
      October 2014
      84 pages

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 October 2014
      Published in PVLDB Volume 8, Issue 2

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Systemizing Interprocedural Static Analysis of Large-scale Systems Code with GraspanACM Transactions on Computer Systems10.1145/346682038:1-2(1-39)Online publication date: 29-Jul-2021
      • (2021)Vertex-centric Parallel Computation of SQL QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457314(1664-1677)Online publication date: 9-Jun-2021
      • (2021)G-thinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00688-z31:2(287-320)Online publication date: 4-Aug-2021
      • (2020)GraphiteProceedings of the VLDB Endowment10.14778/3380750.338075113:6(783-797)Online publication date: 11-Mar-2020
      • (2020)Efficient Graph Query Processing over Geo-Distributed DatacentersProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401157(619-628)Online publication date: 25-Jul-2020
      • (2020)Rosetta: A Robust Space-Time Optimized Range Filter for Key-Value StoresProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389731(2071-2086)Online publication date: 11-Jun-2020
      • (2020)Key-Value Storage EnginesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3383133(2667-2672)Online publication date: 11-Jun-2020
      • (2020)GraphMap: scalable iterative graph processing using NoSQLThe Journal of Supercomputing10.1007/s11227-019-03097-w76:9(6619-6647)Online publication date: 1-Sep-2020
      • (2020)An investigation of big graph partitioning methods for distribution of graphs in vertex-centric systemsDistributed and Parallel Databases10.1007/s10619-019-07256-z38:1(1-29)Online publication date: 1-Mar-2020
      • (2019)CGraphACM Transactions on Storage10.1145/331940615:2(1-26)Online publication date: 20-Apr-2019
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media