Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Management and Analysis of Big Graph Data: Current Systems and Open Challenges

  • Chapter
  • First Online:
Handbook of Big Data Technologies

Abstract

Many big data applications in business and science require the management and analysis of huge amounts of graph data. Suitable systems to manage and to analyze such graph data should meet a number of challenging requirements including support for an expressive graph data model with heterogeneous vertices and edges, powerful query and graph mining capabilities, ease of use as well as high performance and scalability. In this chapter, we survey current system approaches for management and analysis of “big graph data”. We discuss graph database systems, distributed graph processing systems such as Google Pregel and its variations, and graph dataflow approaches based on Apache Spark and Flink. We further outline a recent research framework called Gradoop that is build on the so-called Extended Property Graph Data Model with dedicated support for analyzing not only single graphs but also collections of graphs. Finally, we discuss current and future research challenges.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://lod-cloud.net/.

  2. 2.

    http://tinkerpop.apache.org/.

  3. 3.

    http://wiki.blazegraph.com/wiki/index.php/RDF_GAS_API.

  4. 4.

    http://db-engines.com/en/ranking/graph+dbms.

  5. 5.

    https://www.w3.org/TR/rdf-schema/#ch_reificationvocab.

  6. 6.

    http://www.opencypher.org/.

  7. 7.

    We use vertex compute function and vertex function interchangeably throughout this section.

  8. 8.

    In its core, Flink is a distributed streaming system and provides streaming as well as batch APIs. We focus on the batch API, as Gelly is currently implemented on top of that.

  9. 9.

    Flink supports further systems as data source and sink, e.g., relational and NoSQL databases or queuing systems.

  10. 10.

    When implemented using a synchronous graph-processing system.

  11. 11.

    The coGroup transformation groups each input dataset on one or more fields and then joins the groups.

  12. 12.

    GSA is a variant of the GAS abstraction introduced by PowerGraph [41] and discussed in Sect. 3.

  13. 13.

    The Neighbor class allows access to the incident edge value and the adjacent vertex value.

  14. 14.

    An operator fulfills the closure property if the execution of that operator on members of an input domain results in members of the same domain.

  15. 15.

    http://www.gradoop.com.

  16. 16.

    http://hbase.apache.org.

  17. 17.

    The betweenness centrality of a vertex is defined as the number of shortest paths in a network pathing through the vertex. A high value thus indicates that a vertex is centrally located so that it plays an important role in a network.

  18. 18.

    www.mpi-inf.mpg.de/yago-naga/yago/.

  19. 19.

    http://dbpedia.org/.

  20. 20.

    www.wikidata.org.

  21. 21.

    http://neo4j.com/graph-visualization-neo4j/.

References

  1. C. Aggarwal, K. Subbian, Evolutionary network analysis: a survey. ACM Comput. Surv. (CSUR) 47(1), 10 (2014)

    Article  MATH  Google Scholar 

  2. G.A. Agha, Actors: a model of concurrent computation in distributed systems Technical report, DTIC Document (1985)

    Google Scholar 

  3. Akka. http://www.akka.io. Accessed 10 Mar 2016

  4. A. Alexandrov et al., The stratosphere platform for big data analytics. VLDB J. 23(6) (2014)

    Google Scholar 

  5. AllegroGraph. http://franz.com/agraph/allegrograph/. Accessed 10 Mar 2016

  6. R. Angles, A comparison of current graph database models, in Proceedings of ICDEW (2012)

    Google Scholar 

  7. R. Angles, C. Gutierrez, Survey of graph database models. ACM Comput. Surv. (CSUR) 40(1) (2008)

    Google Scholar 

  8. R. Angles et al., The linked data benchmark council: a graph and RDF industry benchmarking effort. Proc. SIGMOD 43(1) (2014)

    Google Scholar 

  9. Apache Flink Iteration Operators. https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#iteration-operators. Accessed 09 Mar 2016

  10. Apache Giraph. http://www.giraph.apache.org. Accessed 10 Mar 2016

  11. Apache Jena - TBD. https://jena.apache.org/documentation/tdb/. Accessed 09 Mar 2016

  12. T.G. Armstrong et al., Linkbench: a database benchmark based on the facebook social graph (2013)

    Google Scholar 

  13. G. Bagan et al. gMark: Controlling Diversity in Benchmarking Graph Databases. CoRR abs/1511.08386 (2015)

    Google Scholar 

  14. O. Batarfi et al., Large scale graph processing systems: survey and an experimental evaluation. Clust. Comput. 18(3) (2015)

    Google Scholar 

  15. K. Bellare et al., Woo: a scalable and multi-tenant platform for continuous knowledge base synthesis. PVLDB 6(11) (2013)

    Google Scholar 

  16. D.P. Bertsekas, J.N. Tsitsiklis, Parallel and distributed computation: numerical methods, vol. 23 (1989)

    Google Scholar 

  17. Big Data Spatial and Graph User’s Guide and Reference. http://docs.oracle.com/cd/E69290_01/doc.44/e67958/toc.htm. Accessed 16 Mar 2016

  18. H. Bolouri, Modeling genomic regulatory networks with big data. Trends Genet. 30(5) (2014)

    Google Scholar 

  19. D. Brickley, L. Miller, Foaf vocabulary specification 0.98. Namespace document 9 (2012)

    Google Scholar 

  20. A. Buluç et al., Recent advances in graph partitioning. CoRR (2013)

    Google Scholar 

  21. M. Canim, Y.C. Chang, System G data store: big, rich graph data analytics in the cloud, in IEEE Cloud Engineering (IC2E) (March 2013)

    Google Scholar 

  22. G. Carothers, RDF 1.1 N-Quads: a line-based syntax for RDF datasets. W3C Recommendation (2014)

    Google Scholar 

  23. R. Cattell, Scalable SQL and NoSQL data stores. Proc. SIGMOD 39(4) (2011)

    Google Scholar 

  24. C. Chen et al., Graph OLAP: towards online analytical processing on graphs, in IEEE Data Mining (ICDM) (2008)

    Google Scholar 

  25. R. Cheng et al., Kineograph: taking the pulse of a fast-changing and connected world, in Proceedings of EuroSys (2012)

    Google Scholar 

  26. Cypher Query Language. http://neo4j.com/docs/stable/cypher-query-lang.html. Accessed 16 Mar 2016

  27. S. Das et al., A Tale of two graphs: property graphs as RDF in Oracle, in EDBT (2014)

    Google Scholar 

  28. R. Diestel, Graph theory, Graduate Texts in Mathematics, vol. 173, 4th edn. (2012)

    Google Scholar 

  29. Y. Ding, Scientific collaboration and endorsement: network analysis of coauthorship and citation networks. J. Inform. 5(1) (2011)

    Google Scholar 

  30. X. Dong et al., Knowledge Vault: a web-scale approach to probabilistic knowledge fusion, in Proceedings of SIGKDD (2014)

    Google Scholar 

  31. B. Elser, A. Montresor, An evaluation study of bigdata frameworks for graph processing, in IEEE Big Data (2013)

    Google Scholar 

  32. O. Erling, I. Mikhailov, RDF support in the Virtuoso DBMS, in Networked Knowledge-Networked Media (2009)

    Google Scholar 

  33. O. Erling et al., The ldbc social network benchmark: interactive workload, in Proceedings of SIGMOD(2015)

    Google Scholar 

  34. S. Ewen et al., Spinning fast iterative data flows. PVLDB 5(11) (2012)

    Google Scholar 

  35. S. Ewen et al., Iterative parallel data processing with stratosphere: an inside look, in Proceedings of SIGMOD (2013)

    Google Scholar 

  36. S. Fortunato, Community detection in graphs. Phys. Rep. 486(3–5) (2010)

    Google Scholar 

  37. B. Gallagher, Matching structure and semantics: a survey on graph-based pattern matching. AAAI FS 6 (2006)

    Google Scholar 

  38. J. Gao et al., Glog: a high level graph analysis system using mapreduce, in Proceedings of ICDE (2014)

    Google Scholar 

  39. Gelly: Flink Graph API. https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html. Accessed 15 Mar 2016

  40. A. Ghrab et al., A framework for building OLAP cubes on graphs, in Advances in Databases and Information Systems (2015)

    Google Scholar 

  41. J.E. Gonzalez et al., Powergraph: distributed graph-parallel computation on natural graphs, in Proceedings of OSDI (2012)

    Google Scholar 

  42. J.E. Gonzalez et al., GraphX: graph processing in a distributed dataflow framework, in Proceedings of OSDI (2014)

    Google Scholar 

  43. GraphDB: At Last, the Meaningful Database. http://ontotext.com/documents/reports/PW_Ontotext.pdf. Whitepaper July 2014

  44. Y. Guo et al., How well do graph-processing platforms perform? An empirical performance evaluation and analysis, in Proceedings of Parallel and Distributed Processing Symposium (2014)

    Google Scholar 

  45. D. Haas et al., Wisteria: nurturing scalable data cleaning infrastructure. PVLDB 8(12) (2015)

    Google Scholar 

  46. T. Haerder, A. Reuter, Principles of transaction-oriented database recovery. ACM Comput. Surv. 15(4) (1983)

    Google Scholar 

  47. M. Han et al., An experimental comparison of pregel-like graph processing systems. PVLDB 7(12) (2014)

    Google Scholar 

  48. S. Harris, A. Seaborne, E. Prudhommeaux, SPARQL 1.1 query language. W3C Recommendation 21 (2013)

    Google Scholar 

  49. O. Hartig, B. Thompson, Foundations of an alternative approach to reification in RDF. Technical Report. arXiv:1406.3399 (2014)

  50. T. Hayashi, T. Akiba, Y. Yoshida, Fully dynamic betweenness centrality maintenance on massive networks. PVLDB 9(2) (2015)

    Google Scholar 

  51. J. Huang, D.J. Abadi, LEOPARD: lightweight edge-oriented partitioning and replication for dynamic graphs. PVLDB 9(7) (2016)

    Google Scholar 

  52. InfiniteGraph: The Distributed Graph Database. http://www.objectivity.com/wp-content/uploads/Objectivity_WP_IG_Distr_Benchmark.pdf. Whitepaper 2012

  53. B. Iordanov, HyperGraphDB: a generalized graph database, in Web-Age Information Management (2010)

    Google Scholar 

  54. N. Jain, G. Liao, T.L. Willke, Graphbuilder: scalable graph ETL framework, in International Workshop on Graph Data Management Experiences and Systems (2013)

    Google Scholar 

  55. C. Jiang et al., A survey of Frequent Subgraph Mining algorithms. Knowl. Eng. Rev. 28(1) (2013)

    Google Scholar 

  56. M. Junghanns et al., GRADOOP: Scalable Graph Data Management and Analytics with Hadoop. Technical Report. arXiv:1506.00548 (2015)

  57. M. Junghanns et al., Analyzing extended property graphs with apache flink, in Proceedings of SIGMOD Workshop on Network Data Analytics (2016)

    Google Scholar 

  58. Z. Kaoudi, I. Manolescu, RDF in the clouds: a survey. VLDB J. 24(1) (2015)

    Google Scholar 

  59. G. Karypis, V. Kumar, Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. 48(1) (1998)

    Google Scholar 

  60. Key Features - ArangoDB. https://www.arangodb.com/key-features/. Accessed 10 Mar 2016

  61. Z. Khayyat et al., Mizan: a system for dynamic load balancing in large-scale graph processing, in Proceedings EuroSys (2013)

    Google Scholar 

  62. Z. Khayyat et al., Bigdansing: a system for big data cleansing, in Proceedings SIGMOD (2015)

    Google Scholar 

  63. G. Klyne, J.J. Carroll, Resource description framework (RDF): concepts and abstract syntax (2006)

    Google Scholar 

  64. L. Kolb, A. Thor, E. Rahm, Dedoop: efficient deduplication with Hadoop. PVLDB 5(12) (2012)

    Google Scholar 

  65. L. Kolb, Z. Sehili, E. Rahm, Iterative computation of connected graph components with MapReduce. Datenbank-Spektrum 14(2) (2014)

    Google Scholar 

  66. D. Koller, N. Friedman, Probabilistic graphical models: principles and techniques (2009)

    Google Scholar 

  67. A. Kyrola, G. Blelloch, C. Guestrin, GraphChi: large-scale graph computation on just a PC, in Proceedings OSDI (2012)

    Google Scholar 

  68. J. Lin, M. Schatz, Design patterns for efficient graph algorithms in MapReduce, in Proceedings of 8th Workshop on Mining and Learning with Graphs (2010)

    Google Scholar 

  69. Y. Low et al., Distributed GraphLab: a framework for machine learning and data mining in the cloud. PVLDB 5(8) (2012)

    Google Scholar 

  70. Y. Lu, J. Cheng, D. Yan, H. Wu, Large-scale distributed graph computing systems: an experimental evaluation. PVLDB 8(3) (2014)

    Google Scholar 

  71. G. Malewicz et al., Pregel: a system for large-scale graph processing, in Proceedings of SIGMOD (2010)

    Google Scholar 

  72. MarkLogic Semantics. http://www.marklogic.com/resources/marklogic-semantics-datasheet/. Datasheet March 2016

  73. N. Martinez-Bazan, S. Gomez-Villamor, F. Escale-Claveras, DEX: a high-performance graph database management system, in Proceedings of ICDEW (2011)

    Google Scholar 

  74. R. McColl et al., A performance evaluation of open source graph databases, in Proceedings of PPAAW (2014)

    Google Scholar 

  75. R.R. McCune, T. Weninger, G. Madey, Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv. (CSUR) 48(2) (2015)

    Google Scholar 

  76. F. McSherry et al., Composable incremental and iterative data-parallel computation with naiad. Technical Report MSR-TR-2012-105 (October 2012)

    Google Scholar 

  77. J.J. Miller, Graph database applications and concepts with Neo4j, in Proceedings of Southern Association for Information Systems Conference, vol. 2324 (2013)

    Google Scholar 

  78. J. Mondal, A. Deshpande, Managing large dynamic graphs efficiently, in Proceedings of SIGMOD (2012)

    Google Scholar 

  79. D.G. Murray et al., Naiad: a timely dataflow system, in Proceedings of 24th ACM Symposium on Operating Systems Principles. SOSP ’13 (2013)

    Google Scholar 

  80. R. Nehme, N. Bruno, Automated partitioning design in parallel database systems, in Proceedings of SIGMOD (2011)

    Google Scholar 

  81. M. Nickel, K. Murphy, V. Tresp, E. Gabrilovich, A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1) (2016)

    Google Scholar 

  82. Oracle Spatial and Graph: Advanced Data Management. http://www.oracle.com/technetwork/database/options/spatialandgraph/spatial-and-graph-wp-12c-1896143.pdf. Whitepaper September 2014

  83. A. Petermann et al., BIIIG: enabling business intelligence with integrated instance graphs, in Proceedings of ICDEW (2014)

    Google Scholar 

  84. A. Petermann et al., FoodBroker-generating synthetic datasets for graph-based business analytics, in Big Data Benchmarking (2014)

    Google Scholar 

  85. A. Petermann et al., Graph-based data integration and business intelligence with BIIIG. PVLDB 7(13) (2014)

    Google Scholar 

  86. A. Poulovassilis, M. Levene, A nested-graph model for the representation and manipulation of complex objects. ACM Trans. Inform. Syst. (TOIS) 12(1) (1994)

    Google Scholar 

  87. quasar. http://www.paralleluniverse.co/quasar. Accessed 10 Mar 2016

  88. U.N. Raghavan et al., Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 76, 036106 (2007)

    Article  Google Scholar 

  89. F. Rahimian et al., Distributed vertex-cut partitioning, in Distributed Applications and Interoperable Systems (2014)

    Google Scholar 

  90. E. Rahm, The case for holistic data integration, in Advances in Databases and Information Systems (2016)

    Google Scholar 

  91. J. Rao et al., Automating physical database design in a parallel database, in Proceedings of SIGMOD (2002)

    Google Scholar 

  92. M.A. Rodriguez, The gremlin graph traversal machine and language (invited talk), in Proceedings of 15th Symposium on Database Programming Languages (2015)

    Google Scholar 

  93. M.A. Rodriguez, P. Neubauer, Constructions from dots and lines. Bull. Am. Soc. Inform. Sci. Technol. 36(6) (2010)

    Google Scholar 

  94. A. Roy et al., Chaos: scale-out graph processing from secondary storage, in Proceedings of 25th Symposium on Operating Systems Principles (2015)

    Google Scholar 

  95. M. Rudolf et al., The graph story of the SAP HANA database, in Proceedings of BTW (2013)

    Google Scholar 

  96. S. Sakr, A. Liu, A.G. Fayoumi, The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. (CSUR) 46(1) (2013)

    Google Scholar 

  97. S. Salihoglu, J. Widom, GPS: a graph processing system, in Proceedings of 25th International Conference on Scientific and Statistical Database Management. SSDBM (2013)

    Google Scholar 

  98. N. Satish et al., Navigating the maze of graph analytics frameworks using massive graph datasets, in Proceedings of SIGMOD (2014)

    Google Scholar 

  99. K. Shim, MapReduce algorithms for big data analysis. PVLDB 5(12) (2012)

    Google Scholar 

  100. I. Stanton, G. Kliot, Streaming graph partitioning for large distributed graphs, in Proceedings of SIGKDD

    Google Scholar 

  101. Stardog 4 - The Manual. http://docs.stardog.com/. Accessed 10 Mar 2016

  102. P. Stutz, A. Bernstein, W. Cohen, Signal/collect: graph algorithms for the (semantic) web, in ISWC (2010)

    Google Scholar 

  103. W. Sun et al., SQLGraph: an efficient relational-based property graph store, in Proceedings of SIGMOD (2015)

    Google Scholar 

  104. C. Teixeira et al., Arabesque: a system for distributed graph mining, in Proceedings of 25th Symposium on Operating Systems Principles (2015)

    Google Scholar 

  105. The bigdata RDF Database. https://www.blazegraph.com/whitepapers/bigdata_architecture_whitepaper.pdf. Whitepaper May 2013

  106. Y. Tian, R.A. Hankins, J.M. Patel, Efficient aggregation for graph summarization, in Proceedings of SIGMOD (2008)

    Google Scholar 

  107. Y. Tian et al., From “Think Like a Vertex” to “Think Like a Graph”. PVLDB 7(3) (2013)

    Google Scholar 

  108. TITAN: Distributed Graph Database. http://thinkaurelius.github.io/titan/. Accessed 10 Mar 2016

  109. N.B. Turk-Browne, Functional interactions as big data in the human brain. Science 342(6158) (2013)

    Google Scholar 

  110. L.G. Valiant, A bridging model for parallel computation. CACM 33(8) (1990)

    Google Scholar 

  111. X.H. Wang et al., Ontology based context modeling and reasoning using owl, in Pervasive Computing and Communications Workshops (2004)

    Google Scholar 

  112. Z. Wang et al., Pagrol: parallel graph olap over large-scale attributed graphs, in Proceedings of ICDE (2014)

    Google Scholar 

  113. Why OrientDB? http://orientdb.com/why-orientdb/. Accessed 10 Mar 2016

  114. Y. Xia et al., Graph analytics and storage, in IEEE Big Data (2014)

    Google Scholar 

  115. R.S. Xin et al., GraphX: a resilient distributed graph system on spark, in First International Workshop on Graph Data Management Experiences and Systems. GRADES ’13 (2013)

    Google Scholar 

  116. R.S. Xin et al., GraphX: Unifying Data-Parallel and Graph-Parallel Analytics. Technical Report. arxiv:1402.2394 (2014)

  117. P. Yuan et al., Triplebit: a fast and compact system for large scale rdf data. PVLDB 6(7) (2013)

    Google Scholar 

  118. M. Zaharia et al., Spark: cluster computing with working sets, in Proceedings of 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10 (2010)

    Google Scholar 

  119. N. Zhang, Y. Tian, J.M. Patel, Discovery-driven graph summarization, in Proceedings of ICDE (2010)

    Google Scholar 

  120. P. Zhao et al., Graph cube: on warehousing and OLAP multidimensional networks, in Proceedings of SIGMOD (2011)

    Google Scholar 

  121. Y. Zhao et al., Evaluation and analysis of distributed graph-parallel processing frameworks. J. Cyber Secur. Mobil. 3(3) (2014)

    Google Scholar 

Download references

Acknowledgements

This work is partially funded by the German Federal Ministry of Education and Research under project ScaDS Dresden/Leipzig (BMBF 01IS14014B).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Junghanns .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Junghanns, M., Petermann, A., Neumann, M., Rahm, E. (2017). Management and Analysis of Big Graph Data: Current Systems and Open Challenges. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49340-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49339-8

  • Online ISBN: 978-3-319-49340-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics