NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).
1 of 180
More Related Content
NoSQL Databases: Why, what and when
1. Lorenzo Alberton
@lorenzoalberton
NoSQL Databases:
Why, what and when
NoSQL Databases Demystified
PHP UK Conference, 25th February 2011
1
19. A little theory
Fundamental Principles
of (Distributed) Databases
http://www.timbarcz.com/blog/PassionInProgrammers.aspx
11
20. ACID
ATOMICITY: All or nothing
CONSISTENCY: Any transaction will take the db from one
consistent state to another, with no broken constraints
(referential integrity)
ISOLATION: Other operations cannot access data that has
been modified during a transaction that has not yet completed
DURABILITY: Ability to recover the committed transaction
updates against any kind of system failure (transaction log)
12
21. Isolation Levels, Locking & MVCC
Isolation noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations
13
22. Isolation Levels, Locking & MVCC
Isolation noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations
SERIALIZABLE
All transactions occur in a
completely isolated fashion, as
if they were executed serially
13
23. Isolation Levels, Locking & MVCC
Isolation noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations
SERIALIZABLE REPEATABLE READ
All transactions occur in a Multiple SELECT statements
completely isolated fashion, as issued in the same transaction
if they were executed serially will always yield the same
result
13
24. Isolation Levels, Locking & MVCC
Isolation noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations
SERIALIZABLE REPEATABLE READ
All transactions occur in a Multiple SELECT statements
completely isolated fashion, as issued in the same transaction
if they were executed serially will always yield the same
result
READ COMMITTED
A lock is acquired only on the
rows currently read/updated
13
25. Isolation Levels, Locking & MVCC
Isolation noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations
SERIALIZABLE REPEATABLE READ
All transactions occur in a Multiple SELECT statements
completely isolated fashion, as issued in the same transaction
if they were executed serially will always yield the same
result
READ COMMITTED READ UNCOMMITTED
A lock is acquired only on the A transaction can access
rows currently read/updated uncommitted changes made
by other transactions
13
28. Multi-Version Concurrency Control
Root
Index
Index Index Index
Index Index Index Index Index Index Index
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
16
29. Multi-Version Concurrency Control
obsolete Root
new version
Index Index
Index Index Index Index
Index Index Index Index Index Index Index Index
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
16
30. Multi-Version Concurrency Control
obsolete Root atomic pointer update
new version
Index Index
Index Index Index Index
Index Index Index Index Index Index Index Index
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
16
31. Multi-Version Concurrency Control
obsolete Root atomic pointer update
new version marked for
compaction
Index Index
Index Index Index Index
Index Index Index Index Index Index Index Index
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
16
32. Multi-Version Concurrency Control
obsolete Root atomic pointer update
new version marked for
compaction
Index Index
Reads:
never
blocked
Index Index Index Index
Index Index Index Index Index Index Index Index
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
16
45. Distributed Transactions - 2PC
Coordinator
2) COMMIT
PHASE
(completion
phase)
Acknowledge
b) FAILURE
(abort from
any)
Participants
19
46. Distributed Transactions - 2PC
Coordinator
2) COMMIT
PHASE
Undo transaction (completion
phase)
b) FAILURE
(abort from
any)
Participants
19
47. Problems with 2PC
Blocking Protocol
Risk of indefinite cohort Conservative behaviour
blocks if coordinator fails biased to the abort case
20
48. Paxos Algorithm (Consensus)
Family of Fault-tolerant, distributed implementations
Spectrum of trade-offs:
Number of processors
Number of message delays
Activity level of participants
Number of messages sent
Types of failures
http://www.usenix.org/event/nsdi09/tech/full_papers/yabandeh/yabandeh_html/
http://en.wikipedia.org/wiki/Paxos_algorithm 21
50. ACID & Distributed Systems
http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
51. ACID & Distributed Systems
ACID properties are always desirable
But what about:
Latency
Partition Tolerance
High Availability
?
http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
52. CAP Theorem (Brewer’s conjecture)
2000 Prof. Eric Brewer, PoDC Conference Keynote
2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)
Of three properties of shared-data systems -
data Consistency, system Availability and
tolerance to network Partitions - only two can
be achieved at any given moment in time.
http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
24
53. CAP Theorem (Brewer’s conjecture)
2000 Prof. Eric Brewer, PoDC Conference Keynote
2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)
Of three properties of shared-data systems -
data Consistency, system Availability and
tolerance to network Partitions - only two can
be achieved at any given moment in time.
http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
24
54. Partition Tolerance - Availability
“The network will be allowed to lose arbitrarily many messages
sent from one node to another” [...]
“For a distributed system to be continuously available, every
request received by a non-failing node in the system must result
in a response” - Gilbert and Lynch, SIGACT 2002
http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
55. Partition Tolerance - Availability
“The network will be allowed to lose arbitrarily many messages
sent from one node to another” [...]
“For a distributed system to be continuously available, every
request received by a non-failing node in the system must result
in a response” - Gilbert and Lynch, SIGACT 2002
http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
56. Partition Tolerance - Availability
“The network will be allowed to lose arbitrarily many messages
sent from one node to another” [...]
“For a distributed system to be continuously available, every
request received by a non-failing node in the system must result
in a response” - Gilbert and Lynch, SIGACT 2002
CP: requests can complete at nodes
that have quorum
AP: requests can complete at any
live node, possibly violating strong
consistency
http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
57. Partition Tolerance - Availability
“The network will be allowed to lose arbitrarily many messages
sent from one node to another” [...]
“For a distributed system to be continuously available, every
request received by a non-failing node in the system must result
in a response” - Gilbert and Lynch, SIGACT 2002
CP: requests can complete at nodes
that have quorum
HIGH LATENCY
AP: requests can complete at any ≈
live node, possiblyPARTITION
NETWORK violating strong
consistency
http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html
http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
58. Consistency: Client-side view
A service that is consistent operates fully or not at all.
Strong consistency (as in ACID)
Weak consistency (no guarantee) - Inconsistency window
(*) Temporary inconsistencies
(e.g. in data constraints or
replica versions) are
accepted, but they’re resolved
at the earliest opportunity
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
59. Consistency: Client-side view
A service that is consistent operates fully or not at all.
Strong consistency (as in ACID)
Weak consistency (no guarantee) - Inconsistency window
Eventual* consistency (e.g. DNS)
Causal consistency
Read-your-writes consistency
(the least surprise)
Session consistency (*) Temporary inconsistencies
(e.g. in data constraints or
Monotonic read consistency replica versions) are
accepted, but they’re resolved
Monotonic write consistency at the earliest opportunity
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
60. Consistency: Server-side (Quorum)
N = number of nodes with a replica of the data
(*)
W = number of replicas that must acknowledge the update
R = minimum number of replicas that must participate in a
successful read operation
(*) but the data will be written to N nodes no matter what
W+R>N Strong consistency (usually N=3, W=R=2)
W = N, R =1 Optimised for reads
W = 1, R = N Optimised for writes
(durability not guaranteed in presence of failures)
W + R <= N Weak consistency
27
61. Amazon Dynamo Paper
Consistent Hashing
Vector Clocks
Gossip Protocol
Hinted Handoffs
Read Repair
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf 28
66. Modulo-based Hashing
N1 N2 N3 N4
partition = key % n_servers - 1)
(n_servers
Recalculate the hashes for all the entries if n_servers changes
(i.e. full data redistribution when adding/removing a node)
29
67. Consistent Hashing
2160 0
A
F B
Ring Same hash function
E (key space) for data and nodes
idx = hash(key)
D
Coordinator: next
C
available clockwise
node
http://en.wikipedia.org/wiki/Consistent_hashing 30
68. Consistent Hashing
2160 0
A
F B
Ring Same hash function
E (key space) for data and nodes
idx = hash(key)
D
Coordinator: next
C
available clockwise
node
http://en.wikipedia.org/wiki/Consistent_hashing 30
69. Consistent Hashing
2160 0
A
canonical home
(coordinator node)
for key range A-B
F B
Ring Same hash function
E (key space) for data and nodes
idx = hash(key)
D
Coordinator: next
C
available clockwise
node
http://en.wikipedia.org/wiki/Consistent_hashing 30
70. Consistent Hashing
2160 0
A
F B
Ring Same hash function
E (key space) for data and nodes
idx = hash(key)
D
Coordinator: next
C
available clockwise
node
http://en.wikipedia.org/wiki/Consistent_hashing 30
71. Consistent Hashing
2160 0
A
F B
Ring Same hash function
E (key space) for data and nodes
idx = hash(key)
D
Coordinator: next
C canonical home
for key range A-C available clockwise
node
http://en.wikipedia.org/wiki/Consistent_hashing 30
72. Consistent Hashing
2160 0
only the keys in this
range change location
A
F B
Ring Same hash function
E (key space) for data and nodes
idx = hash(key)
D
Coordinator: next
C canonical home
for key range A-C available clockwise
node
http://en.wikipedia.org/wiki/Consistent_hashing 30
73. Consistent Hashing - Replication
A
F B
Ring
E (key space)
D
C
http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
74. Consistent Hashing - Replication
Key hosted
AB
A in B, C, D
F B
Data replicated in
Ring the N-1 clockwise
E (key space) successor nodes
D
C Node hosting
Key , Key , Key
FA AB BC
http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
76. Consistent Hashing - Node Changes
Key membership
A and replicas are
updated when a
F B
node joins or leaves
Copy Key the network.
Range AB The number of
E
Copy Key replicas for all data
Range FA is kept consistent.
D
C Copy Key
Range EF
32
77. Consistent Hashing - Load Distribution
2160 0
Different Strategies
A
I
Virtual Nodes
H B
Random tokens per each
Ring physical node, partition by
C
G (key space) token value
D
Node 1: tokens A, E, G
F Node 2: tokens C, F, H
E Node 3: tokens B, D, I
33
78. Consistent Hashing - Load Distribution
2160 0
Different Strategies
Virtual Nodes
Q equal-sized partitions,
Ring S nodes, Q/S tokens per
(key space) node (with Q >> S)
Node 1
Node 2
Node 3
Node 4
...
34
79. Vector Clocks & Conflict Detection
A B C write handled by A
Causality-based partial
order over events that
D1 ([A, 1]) happen in the system.
Document version
history: a counter for
each node that updated
the document.
If all update counters in
V1 are smaller or equal
to all update counters in
V2, then V1 precedes V2.
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
80. Vector Clocks & Conflict Detection
A B C write handled by A
Causality-based partial
order over events that
D1 ([A, 1]) happen in the system.
write handled by A
Document version
D2 ([A, 2]) history: a counter for
each node that updated
the document.
If all update counters in
V1 are smaller or equal
to all update counters in
V2, then V1 precedes V2.
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
81. Vector Clocks & Conflict Detection
A B C write handled by A
Causality-based partial
order over events that
D1 ([A, 1]) happen in the system.
write handled by A
Document version
D2 ([A, 2]) history: a counter for
each node that updated
write handled by B write handled by C
the document.
D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in
V1 are smaller or equal
to all update counters in
V2, then V1 precedes V2.
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
82. Vector Clocks & Conflict Detection
A B C write handled by A
Causality-based partial
order over events that
D1 ([A, 1]) happen in the system.
write handled by A
Document version
D2 ([A, 2]) history: a counter for
each node that updated
write handled by B write handled by C
the document.
D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in
V1 are smaller or equal
conflict detected reconciliation handled by A
to all update counters in
? V2, then V1 precedes V2.
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
83. Vector Clocks & Conflict Detection
A B C write handled by A
Causality-based partial
order over events that
D1 ([A, 1]) happen in the system.
write handled by A
Document version
D2 ([A, 2]) history: a counter for
each node that updated
write handled by B write handled by C
the document.
D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in
V1 are smaller or equal
conflict detected reconciliation handled by A
to all update counters in
D5 ([A, 3], [B, 1], [C,1])
? V2, then V1 precedes V2.
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
84. Vector Clocks & Conflict Detection
A B C write handled by A
Vector Clocks can detect
a conflict. The conflict
D1 ([A, 1]) resolution is left to the
application or the user.
The application might
resolve conflicts by
checking relative
timestamps, or with
other strategies (like
merging the changes).
Vector clocks can grow
quite large (!)
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
85. Vector Clocks & Conflict Detection
A B C write handled by A
Vector Clocks can detect
a conflict. The conflict
D1 ([A, 1]) resolution is left to the
write handled by A application or the user.
The application might
D2 ([A, 2])
resolve conflicts by
checking relative
timestamps, or with
other strategies (like
merging the changes).
Vector clocks can grow
quite large (!)
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
86. Vector Clocks & Conflict Detection
A B C write handled by A
Vector Clocks can detect
a conflict. The conflict
D1 ([A, 1]) resolution is left to the
write handled by A application or the user.
The application might
D2 ([A, 2])
resolve conflicts by
write handled by B un-modified replica checking relative
timestamps, or with
D3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like
merging the changes).
Vector clocks can grow
quite large (!)
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
87. Vector Clocks & Conflict Detection
A B C write handled by A
Vector Clocks can detect
a conflict. The conflict
D1 ([A, 1]) resolution is left to the
write handled by A application or the user.
The application might
D2 ([A, 2])
resolve conflicts by
write handled by B un-modified replica checking relative
timestamps, or with
D3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like
merging the changes).
version mismatch D3 ⊇ D4, conflict
detected resolved automatically
Vector clocks can grow
D5 ([A, 3], [B, 1]) quite large (!)
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
88. Gossip Protocol + Hinted Handoff
A
periodic, pairwise,
F B inter-process
interactions of
bounded size
E among randomly-
chosen peers
D
C
37
89. Gossip Protocol + Hinted Handoff
A
I can’t see B, it might be periodic, pairwise,
F down but I need some B inter-process
ACK. My Merkle Tree
root for range XY is interactions of
“ab031dab4a385afda” bounded size
E among randomly-
I can’t see B either.
My Merkle Tree root for chosen peers
range XY is different!
D
C
B must be down
then. Let’s disable it.
37
90. Gossip Protocol + Hinted Handoff
My canonical node is
supposed to be B.
A
periodic, pairwise,
F B inter-process
interactions of
bounded size
E among randomly-
chosen peers
D I see. Well, I’ll take care of it
for now, and let B know
C
when B is available again
37
91. Merkle Trees (Hash Trees)
Leaves: hashes of
ROOT
hash(A, B) data blocks.
Nodes: hashes of
their children.
A B
hash(C, D) hash(E, F)
Used to detect
inconsistencies
C D E F between replicas
hash(001) hash(002) hash(003) hash(004)
(anti-entropy) and
to minimise the
Data Data Data Data
Block Block Block Block amount of
001 002 003 004 transferred data
http://en.wikipedia.org/wiki/Hash_tree 38
103. Voldemort AP
LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks Apache 2
LANGUAGE
Java
API/PROTOCOL
HTTP Java
Thrift
Avro
Protobuf
PERSISTENCE
Pluggable
BDB/MySQL
CONCURRENCY
MVCC
Simple optimistic locking
for multi-row updates,
pluggable storage engine
43
104. Voldemort AP
LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks Apache 2
LANGUAGE
Java
API/PROTOCOL
HTTP Java
Thrift
Avro
Protobuf
PERSISTENCE
Pluggable
BDB/MySQL
CONCURRENCY
MVCC
43
105. Membase CP
LICENSE
DHT (K-V), no SPoF
Apache 2
“VBuckets” LANGUAGE
C/C++
membase memcached Erlang
API/PROTOCOL
persistence distributed
replication in-memory REST/JSON
(fail-over HA) memcached
rebalancing
Unit of consistency and replication
Owner of a subset of the cluster key space
http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
106. Membase CP
LICENSE
DHT (K-V), no SPoF
Apache 2
“VBuckets” LANGUAGE
C/C++
membase memcached Erlang
API/PROTOCOL
persistence distributed
replication in-memory REST/JSON
(fail-over HA) memcached
rebalancing
Unit of consistency and replication
Owner of a subset of the cluster key space
hash function + table lookup
http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
107. Membase CP
LICENSE
DHT (K-V), no SPoF
Apache 2
“VBuckets” LANGUAGE
C/C++
membase memcached Erlang
API/PROTOCOL
persistence distributed
replication in-memory REST/JSON
(fail-over HA) memcached
rebalancing
Unit of consistency and replication
Owner of a subset of the cluster key space
hash function + table lookup
All metadata kept in memory (high throughput / low latency).
Manual/Programmatic failover via the Management REST API.
http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
109. Redis CP
LICENSE
K-V store “Data Structures Server”
BSD
Map, Set, Sorted Set, Linked List LANGUAGE
Set/Queue operations, Counters, Pub-Sub, Volatile keys ANSI C
API
*
+ PROTOCOL
Telnet-
like
PERSISTENCE
10-100K op/s (whole dataset in RAM + VM)
in memory
bg snapshots
Persistence via snapshotting (tunable fsync freq.) REPLICATION
master-slave
Distributed if client supports consistent hashing
http://redis.io/presentation/Redis_Cluster.pdf 46
110. 2) Column Families
Google BigTable paper
Data model: big table, column families
47
111. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
“com.cnn.www” <html>... “CNN” “CNN.com”
column column column
row_key row
http://labs.google.com/papers/bigtable-osdi06.pdf 48
112. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
“com.cnn.www” <html>... “CNN” “CNN.com”
column column column
row_key row
http://labs.google.com/papers/bigtable-osdi06.pdf 48
113. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
“com.cnn.www” <html>... “CNN” “CNN.com”
column column column
row_key row
http://labs.google.com/papers/bigtable-osdi06.pdf 48
114. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
<html>... t3
<html>... t5
“com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8
column column column
row_key row
http://labs.google.com/papers/bigtable-osdi06.pdf 48
115. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
<html>... t3
<html>... t5
“com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8
column column column
row_key row column family
http://labs.google.com/papers/bigtable-osdi06.pdf 48
116. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
<html>... t3
<html>... t5
“com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8
column column column
row_key row column family
ACL
http://labs.google.com/papers/bigtable-osdi06.pdf 48
117. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
<html>... t3
<html>... t5
“com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8
column column column
row_key row column family
Atomic updates ACL
http://labs.google.com/papers/bigtable-osdi06.pdf 48
118. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
<html>... t3
<html>... t5
“com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8
column column column
row_key row column family
Atomic updates Automatic GC ACL
http://labs.google.com/papers/bigtable-osdi06.pdf 48
119. Google BigTable: Data Structure
SSTable
Smallest building block
Persistent immutable Map[k,v]
Operations: lookup by key / key range scan
SSTable
64KB 64KB 64KB
block block block lookup
index
49
120. Google BigTable: Data Structure
SSTable
Tablet
Smallest building block range of rows
Dynamically partitioned
Persistent immutable Map[k,v]
Built from multiple SSTables
Operations: lookup and loadkey range scan
Unit of distribution by key / balancing
Tablet (range Aaa → Bar)
SSTable SSTable
64KB 64KB 64KB 64KB 64KB 64KB
block block block lookup block block block lookup
index index
49
121. Google BigTable: Data Structure
SSTable
Table
Tablet
Smallest Tablets (table segments) make up a table
Multiple building block
Dynamically partitioned range of rows
Persistent immutable Map[k,v]
Built from multiple SSTables
Operations: lookup and loadkey range scan
Unit of distribution by key / balancing
Table
Tablet (range Aaa → Bar)
SSTable SSTable
64KB 64KB 64KB 64KB 64KB 64KB
block block block lookup block block block lookup
index index
49
123. Google BigTable: I/O
memtable read
minor
compaction
memory
GFS
tablet log
SSTable SSTable SSTable
write
50
124. Google BigTable: I/O
memtable read
minor
compaction
memory
GFS
tablet log
SSTable SSTable SSTable
write BMDiff Zippy
50
125. Google BigTable: I/O
memtable read
minor
compaction
memory
GFS
tablet log
SSTable SSTable SSTable
write BMDiff Zippy
merging / major compaction (GC)
50
126. Google BigTable: Location Dereferencing
Metadata Tablets User Tables
... ...
Root Tablet
Master File
...
Chubby ... ...
Replicated, persisted
Root of the
lock service; maintains
metadata tree
tablet server locations
5 replicas, one elected ...
master (via quorum)
Up to 3 levels ...
Paxos algorithm used in the metadata
to keep consistency hierarchy
51
127. Google BigTable: Architecture
fs metadata, ACL,
GC, load balancing
BigTable metadata operations BigTable
client master
data R/W heartbeat
operations messages, GC,
chunk migration
Tablet Tablet Tablet Chubby
Server Server Server track
master lock,
log of live servers
Tablet Tablet Tablet
52
128. HBase CP
LICENSE
OSS implementation of BigTable
Apache 2
LANGUAGE
Java
API/PROTOCOL
REST HTTP
Thrift
PERSISTENCE
memtable/
SSTable
53
129. HBase CP
LICENSE
OSS implementation of BigTable
Apache 2
LANGUAGE
Java
ZooKeeper as API/PROTOCOL
coordinator REST HTTP
Thrift
(instead of Chubby)
PERSISTENCE
memtable/
SSTable
53
130. HBase CP
LICENSE
OSS implementation of BigTable
Apache 2
LANGUAGE
Support for Java
multiple masters API/PROTOCOL
REST HTTP
Thrift
PERSISTENCE
memtable/
SSTable
53
132. HBase CP
LICENSE
OSS implementation of BigTable
Apache 2
LANGUAGE
Java
API/PROTOCOL
REST HTTP
Thrift
PERSISTENCE
memtable/
Data sorted by key SSTable
but evenly distributed
across the cluster
53
134. HBase CP
LICENSE
OSS implementation of BigTable
Apache 2
LANGUAGE
Java
API/PROTOCOL
REST HTTP
Thrift
PERSISTENCE
memtable/
SSTable
53
135. Hypertable CP
LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s) GPLv2
LANGUAGE
C++
API/PROTOCOL
C++
Thrift
PERSISTENCE
memtable/
SSTable
CONCURRENCY
MVCC
HQL (~SQL)
54
136. Hypertable CP
LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s) GPLv2
LANGUAGE
C++
API/PROTOCOL
C++
Hyperspace Thrift
(paxos) used PERSISTENCE
instead of memtable/
SSTable
ZooKeeper
CONCURRENCY
MVCC
HQL (~SQL)
54
137. Hypertable CP
LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s) GPLv2
LANGUAGE
C++
API/PROTOCOL
C++
Thrift
Dynamically PERSISTENCE
adapts to
memtable/
changes in SSTable
workload CONCURRENCY
MVCC
HQL (~SQL)
54
138. Hypertable CP
LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s) GPLv2
LANGUAGE
C++
API/PROTOCOL
C++
Thrift
PERSISTENCE
memtable/
SSTable
CONCURRENCY
MVCC
HQL (~SQL)
54
139. Cassandra AP
LICENSE
Data model of BigTable, infrastructure of Dynamo
Apache 2
LANGUAGE
Java
PROTOCOL
B
col_name Thrift
Avro
col_value PERSISTENCE
timestamp
memtable/
SSTable
Column
CONSISTENCY
Tunable
R/W/N
x
http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
140. Cassandra AP
LICENSE
Data model of BigTable, infrastructure of Dynamo
Apache 2
LANGUAGE
super_column_name Java
PROTOCOL
B
col_name col_name Thrift
... Avro
PERSISTENCE
col_value col_value
timestamp timestamp
memtable/
SSTable
CONSISTENCY
Tunable
R/W/N
x
http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
141. Cassandra AP
LICENSE
Data model of BigTable, infrastructure of Dynamo
Column Family Apache 2
LANGUAGE
Java
PROTOCOL
B
col_name col_name Thrift
row_key
... Avro
PERSISTENCE
col_value col_value
timestamp timestamp
memtable/
SSTable
CONSISTENCY
Tunable
R/W/N
x
http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
142. Cassandra AP
LICENSE
Data model of BigTable, infrastructure of Dynamo
Super Column Family Apache 2
LANGUAGE
super_column_name super_column_name Java
PROTOCOL
B
col_name col_name col_name col_name Thrift
row_key
... ... ... Avro
col_value col_value col_value col_value PERSISTENCE
timestamp timestamp timestamp timestamp
memtable/
SSTable
CONSISTENCY
keyspace.get("column_family", key, ["super_column",] "column")
Tunable
R/W/N
x
http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55