Graph Databases
Graph Databases
Graph Databases
2
Graph Databases
3
The NoSQL movement
• Since the 80s, the dominant back end of business systems has been
a relational database
• In the past decade, we’ve been faced with data that is bigger in
volume, changes more rapidly, and is more structurally varied (in a
definition, Big Data) than can be dealt with by traditional RDBMS
deployments.
4
Limits of relational technologies for Big Data
• The schema of a relational database is static and has to be understood
from the beginning of a database design => Big data may change at
an high rate over the time, so does their structure.
• Query execution times increase as the size of tables and the number
of joins grow (so-called join pain) => this is not substainable when
we require sub-second response to queries (NoSQL approaches tend
to organize the data in such a way that the join is already computed,
but this comes at the price of flexibility: essentially no precomputed
joins cannot be executed)
5
Graph databases
• Graph databases can be used for both OLAP (since are naturally
multidimensional structures ) and OLTP.
6
Graph databases
• Graph databases are schemaless:
– Thus they well behave in response to the dynamics of big data: you
can accumulate data incrementally, without the need of a
predefined, rigid schema;
– They provide flexibility in assiging different pieces of information
with different properties, at any granularity;
– They are very good in managing sparse data.
– This does not mean that intensional aspects cannot be casted into a
graph, but they are not pre-defined and are normally treated as data
are treated.
• Graph databases can be queried through (standardized)
languages: depending on the storage engine (see later) they can
provide very good performances because essentially they avoid
classical joins (but performances depend on the kind of queries)
7
Flexibility in graph databases
Incorporating
dynamic
information is
natural and simple
8
Graph Databases
Embrace Relationships
• Obviously, graph databases are particulary suited to model situations in
which the information is somehow “natively” in the form of a graph.
• The success key of graph databases in these contexts is the fact that they
provide native means to represent relationships.
10
Graph DBs vs Relational DBs- Example
Asking “who are Bob’s friends?” (i.e., those that Bob
considers as friend) is easy
SELECT p1.Person
FROM Person p1 JOIN PersonFriend ON
PersonFriend.FriendID = p1.ID JOIN Person p2 ON
PersonFriend.PersonID = p2.ID
WHERE p2.Person = 'Bob'
11
Graph DBs vs Relational DBs- Example
Things become more problematic when we ask, “who are
Alice’s friends-of-friends?”
FRIEND_OF
FRIEND_OF
Zac
• Indeed, in the SQL query we need 3 joins (each table is joined twice).
• If n is the number of persons and m is the number of pairs of friends,
this means a cost of O(n2m2).
• Indexes reduce this cost, since they allow us to avoid linear search over
a column.
• Assuming that the structure of the index is a binary search tree, the cost
is O((log2n)2(log2m)2)
15
Graph DBs vs Relational DBs- Experiment
Alice FRIEND_OF
L is a function: V x V PowerSet(Σ),
e.g., L={((forOffice,Term),{domain}), ((forOffice,Organization),{range}),
((_id0,AZ),{forOffice, forOrganization})… }
18
From [ABFRV14]
Basic Operations
Given a graph G, the following are operations over G:
AddNode(G,x): adds node x to the graph G.
DeleteNode(G,x): deletes the node x from graph G.
Adjacent(G,x,y): tests if there is an edge from x to y.
Neighbors(G,x): nodes y s.t. there is an edge from x to y.
AdjacentEdges(G,x,y): set of labels of edges from x to y.
Add(G,x,y,l): adds an edge between x and y with label l.
Delete(G,x,y,l): deletes an edge between x and y with label l.
Reach(G,x,y): tests if there a path from x to y.
Path(G,x,y): a (shortest) path from x to y.
2-hop(G,x): set of nodes y s.t. there is a path of length 2 from x to y, or from y to x.
n-hop(G,x): set of nodes y s.t. there is a path of length n from x to y, or from y to x.
19
From [ABFRV14]
Implementation of Graphs
[Sakr and Pardede 2012]
Adjacency Incidence Adjacency Compressed
Incidence Adjacency
List List Matrix Matrix Matrix
V2 L1
(V1,{L2}) (V3,{L3}
)
V3 V4
V4 (V1,{L1})
Properties:
Storage: O(|V|+|E|+|L|)
Adjacent(G,x,y): O(|V|)
Neighbors(G,x): O(|V|)
AdjacentEdges(G,x,y): O(|V|+|E|)
Add(G,x,y,l): O(|V|+|E|)
Delete(G,x,y,l): O(|V|+|E|) 21
From [ABFRV14]
Implementation of Graphs
[Sakr and Pardede 2012]
Adjacency Incidence Adjacency Incidence Compressed
List List Matrix Matrix Adjacency
Matrix
V1 (destination,L (destination,L
2) 1)
V2
(source,L2) (source,L3)
L1
V3
(destination,L3)
V4
V4
(source,L1
)
Properties:
L1 (V4,V1) Storage: O(|V|+|E|+|L|)
L2 (V2,V1)
Adjacent(G,x,y): O(|E|)
Neighbors(G,x): O(|E|)
L3 (V2,V3)
AdjacentEdges(G,x,y): O(|E|)
Add(G,x,y,l): O(|E|)
Delete(G,x,y,l): O(|E|) 23
From [ABFRV14]
Implementation of Graphs
[Sakr and Pardede 2012]
Adjacency Incidence Adjacency Incidence
List List Matrix Matrix
V1
V2 {L2} {L3} L1
V3 V4
V4 {L1}
Properties:
Storage: O(|V|2)
Adjacent(G,x,y): O(1)
Neighbors(G,x): O(|V|)
AdjacentEdges(G,x,y): O(|E|)
Add(G,x,y,l): O(|E|)
Delete(G,x,y,l): O(|E|) 25
From [ABFRV14]
Implementation of Graphs
[Sakr and Pardede 2012]
Adjacency Incidence Adjacency Incidence
List List Matrix Matrix
V1 destination destination
V2 source source L1
V3 destination
V4
V4 source
Properties:
Storage: O(|V|x|E|)
Adjacent(G,x,y): O(|E|)
Neighbors(G,x): O(|V|x|E|)
AdjacentEdges(G,x,y): O(|E|)
Add(G,x,y,l): O(|V|)
Delete(G,x,y,l): O(|V|) 27
From [ABFRV14]
Traversal Search
Unexpanded
nodes are Unexpanded nodes
stored in are stored in a stack.
queue.
28
From [ABFRV14]
Breadth First Search
7 2 5 2 5
1 1 3 6
3 6
8
4 4
0 0
Notation:
1 Starting Node
First Level Visited Nodes
7 2 5 2 5
1 1 3 6
3 6
8
4 4
0 0
Notation:
1 Starting Node
First Level Visited Nodes
*From Graph Database Management Systems. Course on Big Data, prof. Riccardo Torlone
(Univ. Roma Tre), available at www.dia.uniroma3.it/~torlone/bigdata/materiale.html
32
Native graph storage and processing
33
Index-free adjacency
• A database engine that utilizes index-free adjacency is one
in which each node maintains direct references to its
adjacent nodes; each node, therefore acts as a micro-index
of other nearby nodes, which is much cheaper than using
global indexes.
34
Non-native graph storage
36
Property-graph databases
• A property graph is a labeled directed multigraph G = (V, E) where every
node v ∈ N and every edge e ∈ E can be associated with a set of <key,
value> pairs, called properties.
• Each edge represents a relationship between nodes and is associated with a
label, which is the name of the relationship.
37
Property-graph databases
name: Bob
age: 22
name: Alice
age: 18
39
People and their friends example
name: …
name:
Alberto Pepe
……
name: …
40
People and their friends example
1. Query the vertex.name index to find all the vertices with the name “Alberto
Pepe” [O(log2n)] (where n is the number of nodes with the name property)
2. Given the vertex returned, get the k friend edges emanating from this vertex.
[O(k + x)] ] (where k is the number of friends and x is the number of the other
outgoing edges)
3. Given the k friend edges retrieved, get the k vertices on the heads of those
edges. [O(k)]
4. Given these k vertices, get the k name properties of these vertices. [O(ky)]
41
(where y is the number of properties in each vertex)
Hyper-graph databases
• A relationship (called a hyper-edge) can connect any number of nodes, thus
can be useful where the domain consists mainly of many-to-many
relationships
• In the example below we can represent with a unique hyper-edge that Alice
and Bob own together a Mini, a Range Rover and a Prius car. However, we
loose some flexibility in specifying some properties (e.g., who is the primary
owner)
• Notice that any hypergraph database can be encoded into a graph database
42
Triple stores
• Triple stores come from the Semantic Web movement,
where researchers are interested in large-scale knowledge
inference by adding semantic markup to the links that
connect web resources.
• A triple is a subject-predicate-object data structure. Using
triples, we can capture facts, such as “Ginger dances with
Fred” and “Fred likes ice cream.”
• The standard way to represent triples and query them is by
means of RDF and SPARQL, respectively.
Note: structuring information in triples does not per se realize the idea of the Semantic
Web, and thus it does not allow for knowledge inference. Nonetheless, triple stores
turned out to be a particularly useful format to exchange information on the Web and
have become nowadays very popular, not only in the semantic web context.
43
Graph Databases
44
Resource Description Framework
• RDF is a data model
the model is domain-neutral, application-neutral and ready for
internationalization
besides viewing it as a graph data model, it can be also viewed as
an object-oriented model (object/attribute/value)
• A standard XML syntax exists, which allows to
specify RDF databases
RDF data model is an abstract, conceptual layer independent of
XML
Consequently, XML is a transfer syntax for RDF, not a
component of RDF
In principle, RDF data might also occur in a form different from
XML
45
XML
• XML: eXtensible Mark-up Language
• XML documents are written through a user-
defined set of tags
• tags can be used to express additional properties of
the various pieces of information (i.e., enriching it
with its “meaning”)
46
XML: example
“2014”
<course date=“ ”>
<title>Big Data Management</title>
<teacher>
<name>Domenico Lembo</name>
<email>lembo@dis.uniroma1.it</email>
</teacher>
<prereq>none</prereq>
</course>
47
XML
• XML: document = labeled tree
• node = label + attributes/values + contents
“...”
<course date=“ ”>
<title>...</title> course
<teacher>
<name>...</name>
<email>...</email> = title teacher prereq
</teacher>
<prereq>...</prereq>
</course>
name email
48
RDF model
resource value
property
49
RDF triples
example: “the document at
http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/
has Ora Lassila as creator”
triple:
http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/ creator “OraLassila”
”
http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/
creator
“Ora Lassila”
⇒ RDF model = graph
50
RDF graph: example
“W3C”
dc:publisher
http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/
dc:creator
dc:date
“Ora Lassila”
“1999-02-22”
51
Node and edge labels in RDF graphs
node and edge labels:
• URI - Uniform Resource Identifier
• Literal, string that denotes a fixed resource (i.e., a value)
• blank node, i.e., an anonymous label, representing
unnamed resources
but:
• a literal can only appear in object positions (that is, literals are
are end nodes in an RDF graph)
• a blank node can only appear in subject or object positions
Example: Marco knows someone which was born on the Epiphany day
Marco “01-06”
foaf:knows foaf:birthDate
54
Blank nodes: unidentifiable resources
Example: The name of the creator of the specification of the RDF
syntax is Ora Lassila and his email address is
ora.lassila@nokia.com
http://www.w3.org/TR/1999/ “Ora Lassila”
dc:creator myns:Name
REC-rdf-syntax-19990222/
myns:EMail
“ora.lassila@nokia.com”
55
RDF vocabulary
• RDF assigns a specific meaning to certain terms, the terms
defined by the URI prefix
http://www.w3.org/1999/02/22-rdf-syntax-ns#
(usually abbreviated as rdf:)
• Some examples (meaning explained in the following slides)
• rdf:type
• rdf:Seq, rdf:Bag, rdf:Alt
• rdf:_1, rdf:_2,..., rdf:li
• rdf:subject
• rdf:predicate
• rdf:object
• rdf:Statement
• rdf:Property
56
Containers
• Containers are collections
– they allow grouping of resources (or literal values)
• Different types of containers exist
– bag - unordered collection (rdf:Bag)
– seq - ordered collection (= “sequence”) (rdf:Seq)
– alt - represents alternatives (rdf:Alt)
• It is possible to express statements regarding the container (taken as a
whole) or on its members
rdf:_n – n-th member of a sequence
rdf:li – element of a collection
• Duplicate values are permitted (no mechanism to enforce unique value
constraints)
57
Containers
Example: The names of the creators of the RDF syntax
specification are (in order) Ora Lassila and Ralph Swick
http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/ The subject is
an instance of
The object is dc:creator the class
the first rdf:type occurring as
element of rdf:Seq object
the container rdf:_1 rdf:_2
occurring as
subject The class of
“Ora Lassila” “Ralph Swick” ordered
containers.
http://www.w3.org/TR/REC-rdf-syntax/ dc:creator _:X.
_:X rdf:type rdf:Seq.
_:X rdf:_1 “Ora Lassila”.
_:X rdf:_2 “Ralph Swick”.
58
Higher-order statements
• One can make RDF statements about other RDF
statements
– example: “Ralph believes that the web contains one
billion documents”
• Higher-order statements
– allow us to express beliefs (and other modalities)
– are important for trust models, digital signatures,etc.
– also: metadata about metadata
– are represented by modeling RDF in RDF itself
⇒ basic tool: reification, i.e., representation of an
RDF assertion as a resource
59
Reification
60
Reification
• RDF provides a built-in predicate vocabulary for
reification:
• rdf:subject
• rdf:predicate
• rdf:object
• rdf:statement
• Using this vocabulary (i.e., these URIs from the rdf:
namespace) it is possible to represents a triple through a
blank node
61
Reification: example
• the statement “The technical report on RDF was written by Ora
Lassila” can be represented by the following four triples:
_:x rdf:predicate dc:creator.
_:x rdf:subject http://www.w3.org/TR/1999/REC-rdf-
syntax-19990222/.
_:x rdf:object “Ora Lassila”.
_:x rdf:type rdf:statement.
rdf:subject rdf:predicate
rdf:object
rdf:type
rdf:Statement
dc:creator
“Library of
Congress”
63
Exercise
Draw the RDF graph that represents the following assertions:
• Document 1 was created by Paul
• Document 2 and Document 3 were created by the same
author (which is unknown)
• Document 3 says that Document 1 was published by the
W3C
Use the predicates dc:creator and dc:publisher, and
assume that the three documents are identified by the URIs
doc1, doc2, and doc3, respectively.
64
Solution
dc:creator
doc1 “Paul” doc2 dc:creator
rdf:subject myns:says
doc3 dc:creator
65
RDF syntaxes
66
RDF syntaxes
• N3 notation:
subject predicate object.
• Turtle (Terse RDF Triple Language) notation. Example:
@prefix
rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
:mary rdf:type <http://www.ex.org/Gardener>.
:mary :worksFor :ElJardinHaus.
:mary :name "Dalileh Jones"@en.
_:john :worksFor :ElJardinHas.
_:john :idNumber "54321"^^xsd:integer.
67
Turtle Notation: Example*
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix ex: <http://example.org/stuff/1.0/> .
<http://www.w3.org/TR/rdf-syntax-grammar>
dc:title "RDF/XML Syntax Specification (Revised)" ;
ex:editor [
ex:fullname "Dave Beckett";
ex:homePage <http://purl.org/net/dajobe/>
] .
The example encodes an RDF database that expresses the following facts:
- The W3C technical report on RDF syntax and grammar, has the title RDF/XML
Syntax Specification (Revised).
- That report's editor is a certain individual, who in turn
- Has full name Dave Beckett.
- Has a home page at http://purl.org/net/dajobe/.
68
*Taken from http://en.wikipedia.org/wiki/Turtle_(syntax)
Turtle Notation: Example*
The example encodes an RDF database that expresses the following facts:
- The W3C technical report on RDF syntax and grammar, has the title RDF/XML Syntax
Specification (Revised).
- That report's editor is a certain individual, who in turn
- Has full name Dave Beckett.
- Has a home page at http://purl.org/net/dajobe/.
Here are the four triples of the RDF graph made explicit in N-Triples notation:
<http://www.w3.org/TR/rdf-syntax-grammar> <http://purl.org/dc/elements/1.1/
title> "RDF/XML Syntax Specification (Revised)" .
<http://www.w3.org/TR/rdf-syntax-grammar> <http://example.org/stuff/1.0/
editor> _:bnode .
_:bnode <http://example.org/stuff/1.0/fullname> "Dave Beckett" .
_:bnode <http://example.org/stuff/1.0/homePage> <http://purl.org/net/dajobe>.
69
*Taken from http://en.wikipedia.org/wiki/Turtle_(syntax)
Turtle Notation: Example*
@base <http://example.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rel: <http://www.perceive.net/schemas/relationship/> .
<#green-goblin>
rel:enemyOf <#spiderman> ;
a foaf:Person ; # in the context of the Marvel universe
foaf:name "Green Goblin" .
<#spiderman>
rel:enemyOf <#green-goblin> ;
a foaf:Person ;
foaf:name "Spiderman", "Человек-паук"@ru .
70
*Taken from http://www.w3.org/TR/turtle/
RDF/XML syntax
• A node in the RDF graph that represents a resource (labeled or not) is
represented by an element rdf:Description, while its label, if any, is
defined as the value of the rdf:about property
• An edge outgoing from a node N is represented as a sub-element of the
element that represents N. The type of this sub-element is the label of
the edge.
• The end node of an edge is represented as the content of the element
representing the edge. It is either a
– a value (if the end node contains a literal)
– or a new resource (if the end node contains a URI): in this case it is
represented by a sub-element of type rdf:Description
• Values (literal) can be assigned with a type (the same defined in XML-
Schema)
71
RDF/XML syntax: Example
dc:creator
http://www.w3.org/TR/REC-rdf-syntax/ “Ora Lassila”
<?xml version=“1.0”?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:dc=“http://purl.org/dc/elements/1.1/”>
<rdf:Description rdf:about=“http://www.w3.org/TR/ REC-
rdf-syntax”>
<dc:creator>Ora Lassila</dc:creator>
</rdf:Description>
</rdf:RDF>
72
RDF/XML syntax: Example
<rdf:Description rdf:about="http://www.w3.org/TR/rdf-syntax-grammar">
<ex:editor>
<rdf:Description>
<ex:homePage>
<rdf:Description rdf:about="http://purl.org/net/dajobe/">
</rdf:Description>
</ex:homePage>
…….
</rdf:Description>
</ex:editor>
……
</rdf:Description>
73
RDF/XML syntax: simplifications
• A resource that is a literal and that is the object of a predicate
may be encoded as the value of an attribute of the element that
represents the subject. The type of such element is the label of
the predicate
74
RDF/XML simplified syntax: Example
dc:creator
http://www.w3.org/TR/REC-rdf-syntax/ “Ora Lassila”
<?xml version=“1.0”?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:dc=“http://purl.org/dc/elements/1.1/”>
<rdf:Description rdf:about=“http://www.w3.org/TR/ REC-
rdf-syntax”>
<dc:creator>Ora Lassila</dc:creator>
</rdf:Description>
</rdf:RDF>
75
RDF/XML simplified syntax: Example
76
Exercise 2: RDF/XML syntax
rdf:subject
myns:dice
myns:doc3 dc:creator
77
Exercise 2: solution
<?xml version=“1.0”?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:dc=“http://purl.org/dc/elements/1.1/”
xmlns=”http://www.dis.uniroma1.it/~poggi/esempi_rdf/”>
<rdf:Description rdf:about=“doc2”>
<dc:creator rdf:nodeID=“C”/>
</rdf:Description>
<rdf:Description rdf:about=“doc3”>
<dc:creator rdf:nodeID=“C”/>
<dice>
<rdf:Description rdf:object=“W3C”>
<rdf:type rdf:resource=“http://www.w3.org/1999/02/22-
rdf-syntax-ns#Statement”/>
<rdf:predicate rdf:resource=“http://purl.org/dc/
elements/1.1/publisher”/>
<rdf:subject>
<rdf:Description rdf:about=“doc1”
dc:creator=“Paul”/>
</rdf:subject>
</rdf:Description>
</dice>
</rdf:Description>
</rdf:RDF>
78
RDF Schema
79
RDFS
80
RDFS
81
Legenda
RDF instance
RDFS schema
predefined in RDFS
logic. implied by the
RDFS semantics
RDFS - example
rdfs:subClassOf
rdf:type
rdfs:subClassOf
rdf:type
rdfs:Resource rdfs:Class rdf:Property
rdf:type
rdfs:subClassOf rdf:type
Person rdf:type
rdf:type
rdfs:subClassOf rdfs:subClassOf
rdf:type rdf:type
hasSupervisor
Frank Jeen
82
RDFS – example: triples
Student rdfs:subClassOf Person.
Researcher rdfs:subClassOf Person.
hasSupervisor rdfs:range Researcher.
hasSupervisor rdfs:domain Student.
Frank rdf:type Student.
Jeen rdf:type Researcher.
Frank hasSupervisor Jeen.
83
RDFS – example: XML syntax
<?xml version=“1.0”?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:rdfs=“http://www.w3.org/2000/01/rdf-schema#”>
<rdf:Description rdf:about=“#Student”>
<rdfs:subClassOf rdf:resource=“#Person”/>
</rdf:Description>
<rdf:Description rdf:about=“#Researcher”>
<rdfs:subClassOf rdf:resource=“#Person”/>
</rdf:Description>
<rdf:Description rdf:about=“#hasSupervisor”>
<rdfs:domain rdf:resource=“#Student”/>
<rdfs:range rdf:resource=“#Researcher”/>
</rdf:Description>
<rdf:Description rdf:about=“#Frank”>
<rdf:type rdf:resource=“#Student”/>
<hasSupervisor rdf:resource=“#Jeen”/>
</rdf:Description>
<rdf:Description rdf:about=“#Jeen”>
<rdf:type rdf:resource=“#Researcher”/>
</rdf:Description>
84
</rdf:RDF>
RDFS
example (classes):
(ex:MotorVehicle, rdf:type, rdfs:Class)
(ex:PassengerVehicle, rdf:type, rdfs:Class)
(ex:Van, rdf:type, rdfs:Class)
(ex:Truck, rdf:type, rdfs:Class)
(ex:MiniVan, rdf:type, rdfs:Class)
(ex:PassengerVehicle, rdfs:subClassOf,
ex:MotorVehicle)
(ex:Van, rdfs:subClassOf, ex:MotorVehicle)
(ex:Truck, rdfs:subClassOf, ex:MotorVehicle)
(ex:MiniVan, rdfs:subClassOf, ex:Van)
(ex:MiniVan, rdfs:subClassOf, ex:PassengerVehicle)
85
RDFS
example (classes):
86
RDFS
example (properties):
(ex:weight, rdf:type, rdf:Property)
(ex:weight, rdfs:domain, ex:MotorVehicle)
(ex:weight, rdfs:range, Integer)
87
RDFS: meta-modeling abilities
example (meta-classes):
(ex:MotorVehicle, rdf:type, rdfs:Class)
(ex:myClasses, rdf:type, rdfs:Class)
(ex:MotorVehicle, rdf:type, ex:myClasses)
88
RDFS: XML syntax
example:
<rdf:Description rdf:about="MotorVehicle">
<rdf:type resource="http://www.w3.org/...#Class"/>
<rdfs:subClassOf rdf:resource="http://www.w3.org/...#Resource"/>
</rdf:Description>
<rdf:Description rdf:about="Truck">
<rdf:type rdf:resource="http://www.w3.org/...#Class"/>
<rdfs:subClassOf rdf:resource="#MotorVehicle"/>
</rdf:Description>
89
RDFS: XML syntax
example (cont.):
<rdf:Description rdf:about="registeredTo">
<rdf:type resource="http://www.w3.org/...#Property"/>
<rdfs:domain rdf:resource="#MotorVehicle"/>
<rdfs:range rdf:resource="#Person"/>
</rdf:Description>
<rdf:Description rdf:about=”ownedBy">
<rdf:type resource="http://www.w3.org/...#Property"/>
<rdfs:subPropertyOf rdf:resource="#registeredTo"/>
</rdf:Description>
90
RDF + RDFS: semantics
• what is the exact meaning of an RDF(S) graph?
• initially, a formal semantics was not defined!
• main problems:
• bnodes
• meta-modeling
• formal semantics for RDFS vocabulary
• recently, a model-theoretic semantics has been
provided
⇒formal definition of entailment and query
answering over RDF(S) graphs
91
Incomplete information in RDF graphs
• bnodes = existential values (null values)
⇒ introduce incomplete information in RDF graphs
• an RDF graph can be seen as an incomplete
database represented in the form of a naïve table,
i.e., relational tables containing values and named
existential variable (also called labeled nulls)
• an RDF graph can be thus represented by a unique
(naïve) table T, with values being constants or
named existential variables
92
RDF + RDFS: semantics
or
(#C rdf:type #C)
are correct (formally meaningful) RDF statements
⇒ but no intuitive semantics
93
Formal semantics for RDF + RDFS
94
Exercise 3: RDF/RDFS model
96
Exercise 3: solution (graph)
rdfs:Class
rdf:Property
rdfs:domain rdfs:range
URI1 URI3 URI2
URI3
URI4 URI6 URI5
97
Exercise 3: solution (XML syntax)
<?xml version=“1.0”?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:rdfs=“http://www.w3.org/2000/01/rdf-schema#”>
<rdf:Description rdf:about=“URI1”>
<rdf:type rdf:resource=“http://www.w3.org/2000/01/rdf-schema#Class”/>
</rdf:Description>
<rdf:Description rdf:about=“URI2”>
<rdf:type rdf:resource=“http://www.w3.org/2000/01/rdf-schema#Class”/>
</rdf:Description>
<rdf:Description rdf:about=“URI3”>
<rdf:type rdf:resource=“http://www.w3.org/1999/02/22-rdf-syntax-ns#Property”/>
<rdfs:domain rdf:resource=“URI1”/>
<rdfs:range rdf:resource=“URI2”/>
</rdf:Description>
<rdf:Description rdf:about=“URI4”>
<rdf:type rdf:resource=“http://www.w3.org/2000/01/rdf-schema#Class”/>
<URI3 rdf:resource=“URI6”/>
</rdf:Description>
<rdf:Description rdf:about=“URI5”>
<rdf:type rdf:resource=“URI2”/>
</rdf:Description>
<rdf:Description rdf:about=“URI6”>
<rdf:type rdf:resource=“URI2”/>
</rdf:Description>
</rdf:RDF>
98
Graph Databases
99
Querying RDF: SPARQL
100
SPARQL – query structure
• SPARQL query includes, in the following order:
– prefix declaration, to abbreviate URIs (optional)
– dataset definitions, to specify the graph to be queried (they can be more than
one)
– SELECT clause, to specify the information to be returned
– WHERE clause, to specify the query pattern, i.e., the conditions that have to be
satisfied by the triples of the dataset
– additional modifiers, to re-organize the results of the query (optional)
# prefix declaration
PREFIX es: <...>
...
# dataset definition
FROM <...>
# data to be returned
SELECT ...
# graph pattern specification
WHERE { ...}
# modifiers
ORDER BY ...
101
SPARQL – the WHERE clause
102
SPARQL – example
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?author
FROM <http://thedatahub.org/dataset/bluk-bnb>
WHERE { ?x dct:creator ?y.
?x dct:title "Romeo and Juliet".
?y foaf:name ?author}
• Variables are outlined through the "?" prefix ("$" is also possible).
• The ?author variable will be returned as result.
• The FROM clause specifies the URI of the graph to be queried
• The SPARQL query processor returns all hits matching the pattern of the
four RDF-triples.
• "property orientation" (class matches can be conducted solely through class-
attributes/properties)
103
SPARQL – query evaluation
The query returns all resources R for which there are resources
X, Y, such that replacing variables ?authors, ?x and ?y,
respectively, you get the triples in the queried graph.
QUERY QUERIED GRAPH
?x ?y ?author
dct:creator bnb:resource/013567865 bnb:person/AppignanesiRichard
foaf:name
dct:creator foaf:name “Richard
Appignanesi”
dct:title dct:title
“Romeo and Juliet” “Romeo and Juliet”
bnb:resource bnb:person/
RESULT /015432907 ShakespeareWilliam1564-1616
104
SPARQL endpoints
• SPARQL queries are performed on RDF dataset (i.e, graphs)
• A SPARQL endpoint accepts queries and returns results via the HTTP
protocol
– generic endpoints query all RDF datasets datasets that are accessible
via the Web
• http://semantic.ckan.net/sparql,
http://lod.openlinksw.com/sparql
– Dedicated endpoints are intended to query one or more specific dataset
• http://bnb.data.bl.uk/sparql,
http://dbpedia.org/sparql ...
• The FROM clause, in principle, is mandatory, but
– when the endpoint is dedicated, typically, you can omit it in the
specification of queries over such endpoint
– when the endpoint is generic, there is often a default dataset that is
queried in the case in which the FROM clause is not specified
-> In our examples, we often omit the FROM clause, implicitly assuming we
are querying specific endpoints
105
SPARQL results
106
SPARQL query – example
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
_:a foaf:name "Johnny Lee Outlaw" .
RDF graph: _:a foaf:mbox <mailto:jlow@example.com> .
_:b foaf:name "Peter Goodguy" .
_:b foaf:mbox <mailto:peter@example.org> .
_:c foaf:mbox <mailto:carol@example.org> .
107
SPARQL – use of filters: example
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
_:a foaf:name "Johnny Lee Outlaw" .
RDF graph: _:a foaf:mbox <mailto:jlow@example.com> .
_:b foaf:name "Peter Goodguy" .
_:b foaf:mbox <mailto:peter@example.org> .
_:c foaf:mbox <mailto:carol@example.org> .
108
Predicates that can be used
in the FILTER clause
• Logical connectives:
! (NOT)
&& (AND)
|| (OR)
• Comparison: >, <, =, != (not equal), IN, NOT
IN,..
• Test: isURI, isBlank, isLiteral, isNumeric, ...
• ...
109
SPARQL – example of query on DBPedia
110
SPARQL – optional patterns: example 1
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
_:a foaf:name "Johnny Lee Outlaw" .
_:a foaf:mbox <mailto:jlow@example.com> .
RDF graph: _:b foaf:name "Peter Goodguy" .
_:b foaf:mbox <mailto:peter@example.org> .
_:c foaf:mbox <mailto:carol@example.org> .
111
SPARQL – optional patterns: example 2
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
_:a rdf:type foaf:Person .
_:a foaf:name "Alice" .
_:a foaf:mbox <mailto:alice@example.com> .
_:a foaf:mbox <mailto:alice@work.example> .
_:b rdf:type foaf:Person .
_:b foaf:name "Bob" .
“Alice” <mailto:alice@example.com>
“Alice” <mailto:alice@work.example>
“Bob”
112
SPARQL – optional patterns: example 3
• Return all resources contained in the dataset of the British National
Bibliography, whose title is "Romeo and Juliet", along with the 10-digit
ISBN and the 13-digit ISBN, if they have them
prefix dct:<http://purl.org/dc/terms/>
prefix bibo:<http://purl.org/ontology/bibo/>
select ?x ?i10 ?i13
from <http://thedatahub.org/dataset/bluk-bnb>
WHERE {?x dct:title "Romeo and Juliet".
OPTIONAL {?x bibo:isbn10 ?i10}.
OPTIONAL {?x bibo:isbn13 ?i13}}
113
SPARQL – UNIONs of graph patterns
A graph pattern can be defined as the union of two (or more) graph
patterns
Example: Return all the resources stored in the dataset of the British
National Bibliography, whose title is "Romeo and Juliet" and have
either a 10-digits ISBN or a 13 digits ISBN
prefix dct:<http://purl.org/dc/terms/>
prefix bibo:<http://purl.org/ontology/bibo/>
select ?x ?i
from <http://thedatahub.org/dataset/bluk-bnb>
WHERE {{?x dct:title "Romeo and Juliet".
?x bibo:isbn10 ?i} UNION
{?x dct:title "Romeo and Juliet".
?x bibo:isbn13 ?i}}
114
SPARQL – “Querying predicates”
select distinct ?p
from <http://thedatahub.org/dataset/bluk-bnb>
where {<http://bnb.data.bl.uk/id/resource/
015432907> ?p ?v}
115
Exercise 4: SPARQL queries
116
Exercise 4: solution
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?x
WHERE { ?x dc:creator ?y .
?x dc:date ?z . }
117
Exercise 5: SPARQL queries
118
Exercise 5: solution
PREFIX myns: <http://www.dis.uniroma1.it/~poggi/esempi_rdf/>
SELECT ?x
WHERE { myns:uri1 ?x ?y .
myns:uri2 ?x ?z . }
119
Exercise 6: SPARQL queries
120
Exercise 6: solution
PREFIX myns: <http://www.dis.uniroma1.it/~poggi/esempi_rdf/>
SELECT ?x
WHERE { { myns:uri1 ?x ?y } UNION
{ myns:uri2 ?x ?z } }
121
Exercise 7: SPARQL queries
122
Exercise 7: solution
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?z
WHERE { ?x dc:creator ?y .
?y foaf:name ?z .
?x dc:date ?w }
123
Exercise 8: SPARQL queries
124
Exercise 8: solution
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?z ?w
WHERE { ?x dc:creator ?y .
?y foaf:name ?z .
OPTIONAL { ?x dc:date ?w } }
125
SPARQL 1.1: property paths
127
Exercise 9: property paths
128
Exercise 9: solution
PREFIX ex: <http://example.org/example/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?z
WHERE { ex:John (ex:hasFather | ex:hasMother)+ ?x .
?x foaf:name ?z . }
UNION
{ ?x (ex:hasFather | ex:hasMother)+ ex:John .
?x foaf:name ?z . }
129
Exercise 10: property paths
130
Exercise 10: solution
PREFIX ex: <http://example.org/example/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?z
WHERE { ex:John rdf:type ?c .
?c rdfs:subClassOf* ?x . }
Or, equivalently:
131
Graph Databases
132
RDF Storage*
• RDF data management has been studied in a variety of contexts.
This variety is actually reflected in a richness of the perspectives
and approaches to storage and indexing of RDF datasets, typi-
cally driven by particular classes of query patterns and inspired
by techniques developed in various research communities.
• In the literature, we can identify three main basic perspectives
underlying this variety.
• The relational perspective.
• The entity perspective.
• The pure graph-based perspective.
• From: Storing and Indexing Massive RDF Data Sets. Yongming Luo, Francois
Picalausa, George H.L. Fletcher, Jan Hidders, and Stijn Vansummeren. In Semantic
Search over the Web. Springer. 2012
133
The relational perspective
• An RDF graph is seen just as a particular type of relational data,
and techniques developed for storing, indexing and answering
queries on relational data can hence be reused and specialized for
storing and indexing RDF graphs.
136
The horizontal representation - example
rdf triple
relational
horizontal
representation
137
The horizontal representation
138
The horizontal representation –
property tables
• To minimize the storage overhead caused by empty cells,
the so-called property-table approach concentrates on
dividing the wide table in multiple smaller tables containing
related predicates
• For example, in the music fan RDF graph, a different table
could be introduced for Works, Fans, and Artists. In this
scenario, the Works table would have columns for
Composer, FileType, MediaType, and Title, but would not
contain the unrelated phone or friendOf columns.
• How to divide the wide table into property tables is up to the
designers (supports for this is provided by some RDF tools)
139
The horizontal representation –
vertical partitioning
• The so-called vertically partitioned database approach (not
to be confused with the vertical representation approach)
takes the decomposition of the horizontal representation to
its extreme:
each predicate column p of the horizontal table is
materialized as a binary table over the schema (subject,
p). Each row of each binary table essentially corresponds
to a triple.
• Note that, hence, both the empty cell issue and the multiple
object issue are solved at the same time.
140
The relational perspective –
storage of URIs and literals
• Indipendently from the approach followed, under the relational
storage of RDF graphs a certain policy is commonly addressed on
how to store values in tables: rather than storing each URI or
literal value directly as a string, implementations usually
associate a unique numerical identifier to each resource and store
this identifier instead. Indeed,
• since there is no a priori bound on the length of the URIs or literal
values that can occur in RDF graphs, it is necessary to support
variable-length records when storing resources directly as strings
• RDF graphs typically contain very long URI strings and literal
values that, in addition, are frequently repeated in the same RDF
graph.
• Unique identifiers can be computed in two general ways: (i) applying a
hash function to the resource string; (ii) maintaining a counter that is
incremented whenever a new resource is added. In both cases,
dictionary tables are used to translate encoded values into URIs and 141
literals
The entity perspective for
storing RDF graphs
The second basic perspective, originating from the information
retrieval community, is the entity perspective:
• Resources in the RDF graph are interpreted as “objects”, or
“entities”
• each entity is determined by a set of attribute-value pairs
• In particular, a resource r in RDF graph G is viewed as an
entity with the following set of (attribute,value) pairs:
142
The entity perspective - example
rdf triples
entity view
143
The entity perspective
• Techniques from the information retrieval literature can then
be specialized to support queries patterns that retrieve
entities based on particular attributes and/or values
• For example, in the previous representation we have that
user8604 is retrieved when searching for entities born in
1975 (i.e., have 1975 as a value on attribute birthdate) as
well as when searching for entities with friends who like
Impressionist music. Note that entity user3789 is not
retrieved by either of these queries.
• Specific tools provide peculiar solutions to these problems
144
The graph-based perspective for
storing RDF graphs
• Under this graph-based perspective, the focus is on
supporting navigation in the RDF graph when viewed as a
classical graph in which subjects and objects form the
nodes, and triples specify directed, labeled edges. The aim is
therefore to natively store RDF dataset as graphs
.
• Typical query patterns supported in this perspective are
graph-theoretic queries such as reachability between nodes.
145
The graph-based perspective for
storing RDF graphs
146
Graph Databases
147
RDF in the real world
• RDF and SPARQL are W3C standards
• Widespread use for metadata representation, e.g.
• Meta Content Framework (MCF) developed by Apple as a
specification of a content format for structuring metadata
about web sites and other data (it is specified in a language
which is a sort of ancestor of RDF)
• Adobe XMP (Extensible Metadata Platform), an RDF based
schema that offers properties that provide basic descriptive
information on files
• Oracle supports RDF, and provides an extension of SQL to
query RDF data
• HP has a big lab (in Bristol) developing specialized data
stores for RDF (it also initiated the development of the
Jena framework for RDF graph management, carried out
until october 2009 – then by Apache Software foundation ) 148
RDF in the real world
• current main application of RDF: linked data
• linked data = using the Web to create typed links
between data from different sources
• i.e.: create a Web of data
• DBpedia, Geonames, US Census, EuroStat,
MusicBrainz, BBC Programmes, Flickr, DBLP,
PubMed, UniProt, FOAF, SIOC, OpenCyc,
UMBEL, Virtual Observatories, freebase,…
• each source: up to several million triples
• overall: over 31 billions triples (2012)
149
Linked Data
Linked Data: set of best practices for publishing and connecting
structured data on the Web using URIs and RDF
Basic idea: apply the general architecture of the World Wide Web
to the task of sharing structured data on global scale
• The Web is built on the idea of setting hyperlinks between
documents that may reside on different Web servers.
• It is built on a small set of simple standards:
• Global identification mechanism: URIs, IRIs
• Univeral access mechanism: HTTP
• Standardized content format: HTML
Linked Data builds directly on Web architecture and applies this
architecture to the task of sharing data on global scale
150
Linked data principles
1. Use URIs as names for things.
2. Use HTTP URIs, so that people can look up those names
(dereferenceable URIs).
3. When someone looks up a URI, provide useful information,
using the standards (RDF, SPARQL).
4. Include links to other URIs, so that they can discover more
things.
152
Linked Data lifecycle
1. Extraction
2. Storage & Querying
3. Authoring
4. Linking
5. Enrichment
6. Quality Analysis
7. Evolution & Repair
8. Search, Browsing &
Exploration
153
Linked Data lifecycle
155
Open/Closed Linked Data
Linked data may be open (publicly accessible and reusable) or
closed
156
The LOD cloud diagram (9/2011)
Linking Open Data cloud diagram, 09/2011 (by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/)
157
The LOD cloud diagram
9/2007
9/2008
9/2009 9/ 2010
158
Use of RDF vocabularies
• Crucial aspect of Linked Data (and of RDF usage in
general): which URIs represent predicates (links)?
• Recommended practice in LOD: if possible, use
existing RDF vocabularies (and preferably the
most popular ones)
• In this way, a de-facto standard is created: all LOD
sites use the same URI to represent the same
property, and the semantics of such properties is
shared (i.e., known by every application)
• This makes it possible for all applications to really
understand the semantics of links
159
Popular vocabularies
• Friend-of-a-Friend (FOAF), vocabulary for describing people
• Dublin Core (DC) defines general metadata attributes
• Semantically-Interlinked Online Communities (SIOC),
vocabulary for representing online communities
• Description of a Project (DOAP), vocabulary for describing
projects
• Simple Knowledge Organization System (SKOS), vocabulary
for representing taxonomies and loosely structured knowledge
• Music Ontology provides terms for describing artists, albums
and tracks
• Review Vocabulary, vocabulary for representing reviews
• Creative Commons (CC), vocabulary for describing license
terms
160
Graph Databases
161
RDF/SPARQL tools
• Jena = Java framework for handling RDF models
and SPARQL queries (http://jena.sourceforge.net/)
162
RDF/SPARQL (and graph database) tools
• Allegrograph
(http://www.franz.com/agraph/allegrograph/) = it is a
native triple store providing support for SPARQL
queries over RDF datasets. It also provide Prolog query
APIs and offers built-in reasoner over RDFS++ (i.e.,
RDFS predicates plus some of OWL predicates, the
W3C Web Ontology language).
163
RDF/SPARQL (and graph database) tools
• RDF datasets can be also stored in non-native RDF storage systems
• Graph databases (as Allegrograph) are the most suited ones: e.g., the
most widely used graph database today, Neo4j
(http://www.neo4j.org/), provides an RDF/SPARQL module which
relies on a native storage of data in the form of property graph
databases
• Other kinds of NoSQL databases can be used to store RDF triples,
and often provide some kind of SPARQL query support (e.g, HBase,
Couchbase, Cassandra).
• In all these cases, however, no specific index mechanisms for RDF
datasets are guaranteed.
• For some deepenings on the use of NoSQL databases for RDF
storage, we refer to Cudré-Mauroux et al. NoSQL Databases for
RDF: An Empirical Evaluation. ISWC 2013.
164