BigData Tutorial Part3
BigData Tutorial Part3
Storage Layouts
Storage Model
Hierarchical Layout
Flat Layout
Evaluation
Conclusion
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
MOTIVATION
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
Batch (Map-Reduce)
TB
CumulusRDF
(Apache Cassandra)
BigData,
4Store,
YARS2
Jena TBD
Sesame
Distributed
GB
Single
machine
MB
CloudSPARQL
Runtime
Index Lookups
Redland
Jena Mem
SPARQL
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
SAOR
OWLIM
Pellet, HermiT
Reasoning
Algorithmic
complexity
Other options
Only triples with the given URI as subject
Concise Bounded Descriptions
User Agent
http://www.bbc.co.uk/music/
artists/191cba6a-b83f-49ca883c-02b20c7a9dd5#artist
G
E
T
R
D
F
Server
http://www.bbc.co.uk/music/artists/191c
ba6a-b83f-49ca-883c-02b20c7a9dd5.rdf
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
Triple Patterns
A triple pattern is an RDF triple that may contain variables
instead of RDF terms in any position
?s dbpprop:birthPlace dbpedia:Karlsruhe .
or
?s foaf:name ?o .
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
Patterns Index
???
Any
s??
SPO
?p?
POS
??o
OSP
sp?
SPO
?po
POS
s?o
OSP
spo
Any
Apache Cassandra
Open source data
management system
Distributed key-value store (DHT-based)
Nested key-value data model
Schema-less
Decentralized
Every node in the cluster has the same role
No single point of failure
Elastic
Throughput increases linearly
as machines are added with no downtime
Fault-tolerant
Data can be replicated
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
CumulusRDF
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
CumulusRDF Functionality
Distributed deployment to enable scale (more data and
also more clients) by adding more machines (via
Cassandra)
Geographical replication (via Cassandra)
Write-optimised indices with eventual consistency (via
Cassandra)
Triple pattern lookups (via CumulusRDF index structures)
Linked Data Lookups (via CumulusRDF index structures)
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
STORAGE LAYOUTS
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
ro
c00
v00
c01
v01
...
...
r1
c10
v10
c11
v11
...
...
Column key
...
Column value
Row
Super column key
sc00
r2
sc01
c000
v000
c010
v010
...
...
sc00
r3
sc01
c000
v000
c010
v010
...
...
Cassandra limitations
Entire rows always stored on a single node
No range queries on row keys
Columns are stored in specified order and allow for range queries
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
Hierarchical Layout
Uses super columns
RDF terms occupy row, supercolumn and column positions
Value is empty
Row key
rdf:type
Super column key
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
Jaws
dbp:Film
dbp:Work
Column key
Value
Flat Layout
Uses columns only
Range queries on column keys allow prefix lookups
dbp:Jaws
Row key
foaf:name Jaws
rdf:type dbp:Film
rdf:type dbp:Work
Column key
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
Value
POS Index
RDF data is skewed: many triples may share the same predicate
(rdf:type is a prime example)
p as row key will result in a very uneven distribution
Cassandra cannot split rows among several nodes
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
EVALUATION
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
Evaluation
Clients
CumulusRDF
C0
C1
C2
C3
C0-C3:
Cassandra
nodes
all
Node 1
Node 2
Node 3
Node 4
Std. Dev.
Max.
Row
SPO Hier
4.41
4.40
4.41
4.41
0.01
0.0002
SPO Flat
4.36
4.36
4.36
4.36
0.00
0.0004
OSP Hier
5.86
6.00
5.75
6.96
0.56
1.16
OSP Flat
5.66
5.77
5.54
6.61
0.49
0.96
POS Hier
4.43
3.68
4.69
1.08
1.65
2.40
POS Sec
7.35
7.43
7.38
8.05
0.33
0.56
Values in GB
SPO Flat: { s : { po : - } }, OSP
POS Sec: { po : { p : p } }
SPO Hier: { s : { p : { o : - } } }, OSP, POS
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data
Conclusion
We evaluated two index schemes for RDF on nested keyvalue stores to support Linked Data lookups
Flat indexing gives best overall results
Output format impacts performance (N-Triples v RDF/XML)
Marko Grobelnik, Andreas Harth (Gnter Ladwig), Dumitru Roman, Big Linked Data