Abstract
Many big data applications in business and science require the management and analysis of huge amounts of graph data. Suitable systems to manage and to analyze such graph data should meet a number of challenging requirements including support for an expressive graph data model with heterogeneous vertices and edges, powerful query and graph mining capabilities, ease of use as well as high performance and scalability. In this chapter, we survey current system approaches for management and analysis of “big graph data”. We discuss graph database systems, distributed graph processing systems such as Google Pregel and its variations, and graph dataflow approaches based on Apache Spark and Flink. We further outline a recent research framework called Gradoop that is build on the so-called Extended Property Graph Data Model with dedicated support for analyzing not only single graphs but also collections of graphs. Finally, we discuss current and future research challenges.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
We use vertex compute function and vertex function interchangeably throughout this section.
- 8.
In its core, Flink is a distributed streaming system and provides streaming as well as batch APIs. We focus on the batch API, as Gelly is currently implemented on top of that.
- 9.
Flink supports further systems as data source and sink, e.g., relational and NoSQL databases or queuing systems.
- 10.
When implemented using a synchronous graph-processing system.
- 11.
The coGroup transformation groups each input dataset on one or more fields and then joins the groups.
- 12.
- 13.
The Neighbor class allows access to the incident edge value and the adjacent vertex value.
- 14.
An operator fulfills the closure property if the execution of that operator on members of an input domain results in members of the same domain.
- 15.
- 16.
- 17.
The betweenness centrality of a vertex is defined as the number of shortest paths in a network pathing through the vertex. A high value thus indicates that a vertex is centrally located so that it plays an important role in a network.
- 18.
- 19.
- 20.
- 21.
References
C. Aggarwal, K. Subbian, Evolutionary network analysis: a survey. ACM Comput. Surv. (CSUR) 47(1), 10 (2014)
G.A. Agha, Actors: a model of concurrent computation in distributed systems Technical report, DTIC Document (1985)
Akka. http://www.akka.io. Accessed 10 Mar 2016
A. Alexandrov et al., The stratosphere platform for big data analytics. VLDB J. 23(6) (2014)
AllegroGraph. http://franz.com/agraph/allegrograph/. Accessed 10 Mar 2016
R. Angles, A comparison of current graph database models, in Proceedings of ICDEW (2012)
R. Angles, C. Gutierrez, Survey of graph database models. ACM Comput. Surv. (CSUR) 40(1) (2008)
R. Angles et al., The linked data benchmark council: a graph and RDF industry benchmarking effort. Proc. SIGMOD 43(1) (2014)
Apache Flink Iteration Operators. https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#iteration-operators. Accessed 09 Mar 2016
Apache Giraph. http://www.giraph.apache.org. Accessed 10 Mar 2016
Apache Jena - TBD. https://jena.apache.org/documentation/tdb/. Accessed 09 Mar 2016
T.G. Armstrong et al., Linkbench: a database benchmark based on the facebook social graph (2013)
G. Bagan et al. gMark: Controlling Diversity in Benchmarking Graph Databases. CoRR abs/1511.08386 (2015)
O. Batarfi et al., Large scale graph processing systems: survey and an experimental evaluation. Clust. Comput. 18(3) (2015)
K. Bellare et al., Woo: a scalable and multi-tenant platform for continuous knowledge base synthesis. PVLDB 6(11) (2013)
D.P. Bertsekas, J.N. Tsitsiklis, Parallel and distributed computation: numerical methods, vol. 23 (1989)
Big Data Spatial and Graph User’s Guide and Reference. http://docs.oracle.com/cd/E69290_01/doc.44/e67958/toc.htm. Accessed 16 Mar 2016
H. Bolouri, Modeling genomic regulatory networks with big data. Trends Genet. 30(5) (2014)
D. Brickley, L. Miller, Foaf vocabulary specification 0.98. Namespace document 9 (2012)
A. Buluç et al., Recent advances in graph partitioning. CoRR (2013)
M. Canim, Y.C. Chang, System G data store: big, rich graph data analytics in the cloud, in IEEE Cloud Engineering (IC2E) (March 2013)
G. Carothers, RDF 1.1 N-Quads: a line-based syntax for RDF datasets. W3C Recommendation (2014)
R. Cattell, Scalable SQL and NoSQL data stores. Proc. SIGMOD 39(4) (2011)
C. Chen et al., Graph OLAP: towards online analytical processing on graphs, in IEEE Data Mining (ICDM) (2008)
R. Cheng et al., Kineograph: taking the pulse of a fast-changing and connected world, in Proceedings of EuroSys (2012)
Cypher Query Language. http://neo4j.com/docs/stable/cypher-query-lang.html. Accessed 16 Mar 2016
S. Das et al., A Tale of two graphs: property graphs as RDF in Oracle, in EDBT (2014)
R. Diestel, Graph theory, Graduate Texts in Mathematics, vol. 173, 4th edn. (2012)
Y. Ding, Scientific collaboration and endorsement: network analysis of coauthorship and citation networks. J. Inform. 5(1) (2011)
X. Dong et al., Knowledge Vault: a web-scale approach to probabilistic knowledge fusion, in Proceedings of SIGKDD (2014)
B. Elser, A. Montresor, An evaluation study of bigdata frameworks for graph processing, in IEEE Big Data (2013)
O. Erling, I. Mikhailov, RDF support in the Virtuoso DBMS, in Networked Knowledge-Networked Media (2009)
O. Erling et al., The ldbc social network benchmark: interactive workload, in Proceedings of SIGMOD(2015)
S. Ewen et al., Spinning fast iterative data flows. PVLDB 5(11) (2012)
S. Ewen et al., Iterative parallel data processing with stratosphere: an inside look, in Proceedings of SIGMOD (2013)
S. Fortunato, Community detection in graphs. Phys. Rep. 486(3–5) (2010)
B. Gallagher, Matching structure and semantics: a survey on graph-based pattern matching. AAAI FS 6 (2006)
J. Gao et al., Glog: a high level graph analysis system using mapreduce, in Proceedings of ICDE (2014)
Gelly: Flink Graph API. https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html. Accessed 15 Mar 2016
A. Ghrab et al., A framework for building OLAP cubes on graphs, in Advances in Databases and Information Systems (2015)
J.E. Gonzalez et al., Powergraph: distributed graph-parallel computation on natural graphs, in Proceedings of OSDI (2012)
J.E. Gonzalez et al., GraphX: graph processing in a distributed dataflow framework, in Proceedings of OSDI (2014)
GraphDB: At Last, the Meaningful Database. http://ontotext.com/documents/reports/PW_Ontotext.pdf. Whitepaper July 2014
Y. Guo et al., How well do graph-processing platforms perform? An empirical performance evaluation and analysis, in Proceedings of Parallel and Distributed Processing Symposium (2014)
D. Haas et al., Wisteria: nurturing scalable data cleaning infrastructure. PVLDB 8(12) (2015)
T. Haerder, A. Reuter, Principles of transaction-oriented database recovery. ACM Comput. Surv. 15(4) (1983)
M. Han et al., An experimental comparison of pregel-like graph processing systems. PVLDB 7(12) (2014)
S. Harris, A. Seaborne, E. Prudhommeaux, SPARQL 1.1 query language. W3C Recommendation 21 (2013)
O. Hartig, B. Thompson, Foundations of an alternative approach to reification in RDF. Technical Report. arXiv:1406.3399 (2014)
T. Hayashi, T. Akiba, Y. Yoshida, Fully dynamic betweenness centrality maintenance on massive networks. PVLDB 9(2) (2015)
J. Huang, D.J. Abadi, LEOPARD: lightweight edge-oriented partitioning and replication for dynamic graphs. PVLDB 9(7) (2016)
InfiniteGraph: The Distributed Graph Database. http://www.objectivity.com/wp-content/uploads/Objectivity_WP_IG_Distr_Benchmark.pdf. Whitepaper 2012
B. Iordanov, HyperGraphDB: a generalized graph database, in Web-Age Information Management (2010)
N. Jain, G. Liao, T.L. Willke, Graphbuilder: scalable graph ETL framework, in International Workshop on Graph Data Management Experiences and Systems (2013)
C. Jiang et al., A survey of Frequent Subgraph Mining algorithms. Knowl. Eng. Rev. 28(1) (2013)
M. Junghanns et al., GRADOOP: Scalable Graph Data Management and Analytics with Hadoop. Technical Report. arXiv:1506.00548 (2015)
M. Junghanns et al., Analyzing extended property graphs with apache flink, in Proceedings of SIGMOD Workshop on Network Data Analytics (2016)
Z. Kaoudi, I. Manolescu, RDF in the clouds: a survey. VLDB J. 24(1) (2015)
G. Karypis, V. Kumar, Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. 48(1) (1998)
Key Features - ArangoDB. https://www.arangodb.com/key-features/. Accessed 10 Mar 2016
Z. Khayyat et al., Mizan: a system for dynamic load balancing in large-scale graph processing, in Proceedings EuroSys (2013)
Z. Khayyat et al., Bigdansing: a system for big data cleansing, in Proceedings SIGMOD (2015)
G. Klyne, J.J. Carroll, Resource description framework (RDF): concepts and abstract syntax (2006)
L. Kolb, A. Thor, E. Rahm, Dedoop: efficient deduplication with Hadoop. PVLDB 5(12) (2012)
L. Kolb, Z. Sehili, E. Rahm, Iterative computation of connected graph components with MapReduce. Datenbank-Spektrum 14(2) (2014)
D. Koller, N. Friedman, Probabilistic graphical models: principles and techniques (2009)
A. Kyrola, G. Blelloch, C. Guestrin, GraphChi: large-scale graph computation on just a PC, in Proceedings OSDI (2012)
J. Lin, M. Schatz, Design patterns for efficient graph algorithms in MapReduce, in Proceedings of 8th Workshop on Mining and Learning with Graphs (2010)
Y. Low et al., Distributed GraphLab: a framework for machine learning and data mining in the cloud. PVLDB 5(8) (2012)
Y. Lu, J. Cheng, D. Yan, H. Wu, Large-scale distributed graph computing systems: an experimental evaluation. PVLDB 8(3) (2014)
G. Malewicz et al., Pregel: a system for large-scale graph processing, in Proceedings of SIGMOD (2010)
MarkLogic Semantics. http://www.marklogic.com/resources/marklogic-semantics-datasheet/. Datasheet March 2016
N. Martinez-Bazan, S. Gomez-Villamor, F. Escale-Claveras, DEX: a high-performance graph database management system, in Proceedings of ICDEW (2011)
R. McColl et al., A performance evaluation of open source graph databases, in Proceedings of PPAAW (2014)
R.R. McCune, T. Weninger, G. Madey, Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv. (CSUR) 48(2) (2015)
F. McSherry et al., Composable incremental and iterative data-parallel computation with naiad. Technical Report MSR-TR-2012-105 (October 2012)
J.J. Miller, Graph database applications and concepts with Neo4j, in Proceedings of Southern Association for Information Systems Conference, vol. 2324 (2013)
J. Mondal, A. Deshpande, Managing large dynamic graphs efficiently, in Proceedings of SIGMOD (2012)
D.G. Murray et al., Naiad: a timely dataflow system, in Proceedings of 24th ACM Symposium on Operating Systems Principles. SOSP ’13 (2013)
R. Nehme, N. Bruno, Automated partitioning design in parallel database systems, in Proceedings of SIGMOD (2011)
M. Nickel, K. Murphy, V. Tresp, E. Gabrilovich, A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1) (2016)
Oracle Spatial and Graph: Advanced Data Management. http://www.oracle.com/technetwork/database/options/spatialandgraph/spatial-and-graph-wp-12c-1896143.pdf. Whitepaper September 2014
A. Petermann et al., BIIIG: enabling business intelligence with integrated instance graphs, in Proceedings of ICDEW (2014)
A. Petermann et al., FoodBroker-generating synthetic datasets for graph-based business analytics, in Big Data Benchmarking (2014)
A. Petermann et al., Graph-based data integration and business intelligence with BIIIG. PVLDB 7(13) (2014)
A. Poulovassilis, M. Levene, A nested-graph model for the representation and manipulation of complex objects. ACM Trans. Inform. Syst. (TOIS) 12(1) (1994)
quasar. http://www.paralleluniverse.co/quasar. Accessed 10 Mar 2016
U.N. Raghavan et al., Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 76, 036106 (2007)
F. Rahimian et al., Distributed vertex-cut partitioning, in Distributed Applications and Interoperable Systems (2014)
E. Rahm, The case for holistic data integration, in Advances in Databases and Information Systems (2016)
J. Rao et al., Automating physical database design in a parallel database, in Proceedings of SIGMOD (2002)
M.A. Rodriguez, The gremlin graph traversal machine and language (invited talk), in Proceedings of 15th Symposium on Database Programming Languages (2015)
M.A. Rodriguez, P. Neubauer, Constructions from dots and lines. Bull. Am. Soc. Inform. Sci. Technol. 36(6) (2010)
A. Roy et al., Chaos: scale-out graph processing from secondary storage, in Proceedings of 25th Symposium on Operating Systems Principles (2015)
M. Rudolf et al., The graph story of the SAP HANA database, in Proceedings of BTW (2013)
S. Sakr, A. Liu, A.G. Fayoumi, The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. (CSUR) 46(1) (2013)
S. Salihoglu, J. Widom, GPS: a graph processing system, in Proceedings of 25th International Conference on Scientific and Statistical Database Management. SSDBM (2013)
N. Satish et al., Navigating the maze of graph analytics frameworks using massive graph datasets, in Proceedings of SIGMOD (2014)
K. Shim, MapReduce algorithms for big data analysis. PVLDB 5(12) (2012)
I. Stanton, G. Kliot, Streaming graph partitioning for large distributed graphs, in Proceedings of SIGKDD
Stardog 4 - The Manual. http://docs.stardog.com/. Accessed 10 Mar 2016
P. Stutz, A. Bernstein, W. Cohen, Signal/collect: graph algorithms for the (semantic) web, in ISWC (2010)
W. Sun et al., SQLGraph: an efficient relational-based property graph store, in Proceedings of SIGMOD (2015)
C. Teixeira et al., Arabesque: a system for distributed graph mining, in Proceedings of 25th Symposium on Operating Systems Principles (2015)
The bigdata RDF Database. https://www.blazegraph.com/whitepapers/bigdata_architecture_whitepaper.pdf. Whitepaper May 2013
Y. Tian, R.A. Hankins, J.M. Patel, Efficient aggregation for graph summarization, in Proceedings of SIGMOD (2008)
Y. Tian et al., From “Think Like a Vertex” to “Think Like a Graph”. PVLDB 7(3) (2013)
TITAN: Distributed Graph Database. http://thinkaurelius.github.io/titan/. Accessed 10 Mar 2016
N.B. Turk-Browne, Functional interactions as big data in the human brain. Science 342(6158) (2013)
L.G. Valiant, A bridging model for parallel computation. CACM 33(8) (1990)
X.H. Wang et al., Ontology based context modeling and reasoning using owl, in Pervasive Computing and Communications Workshops (2004)
Z. Wang et al., Pagrol: parallel graph olap over large-scale attributed graphs, in Proceedings of ICDE (2014)
Why OrientDB? http://orientdb.com/why-orientdb/. Accessed 10 Mar 2016
Y. Xia et al., Graph analytics and storage, in IEEE Big Data (2014)
R.S. Xin et al., GraphX: a resilient distributed graph system on spark, in First International Workshop on Graph Data Management Experiences and Systems. GRADES ’13 (2013)
R.S. Xin et al., GraphX: Unifying Data-Parallel and Graph-Parallel Analytics. Technical Report. arxiv:1402.2394 (2014)
P. Yuan et al., Triplebit: a fast and compact system for large scale rdf data. PVLDB 6(7) (2013)
M. Zaharia et al., Spark: cluster computing with working sets, in Proceedings of 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10 (2010)
N. Zhang, Y. Tian, J.M. Patel, Discovery-driven graph summarization, in Proceedings of ICDE (2010)
P. Zhao et al., Graph cube: on warehousing and OLAP multidimensional networks, in Proceedings of SIGMOD (2011)
Y. Zhao et al., Evaluation and analysis of distributed graph-parallel processing frameworks. J. Cyber Secur. Mobil. 3(3) (2014)
Acknowledgements
This work is partially funded by the German Federal Ministry of Education and Research under project ScaDS Dresden/Leipzig (BMBF 01IS14014B).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Junghanns, M., Petermann, A., Neumann, M., Rahm, E. (2017). Management and Analysis of Big Graph Data: Current Systems and Open Challenges. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-49340-4_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49339-8
Online ISBN: 978-3-319-49340-4
eBook Packages: Computer ScienceComputer Science (R0)