Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Lorenzo Alberton
                            @lorenzoalberton




 NoSQL Databases:
Why, what and when
      NoSQL Databases Demystified




      PHP UK Conference, 25th February 2011
                                               1
NoSQL: Why
Scalability, Concurrency, New trends




                                       2
New Trends



 2002   2004   2006   2008   2010   2012



           Big data




                                           3
New Trends



 2002   2004   2006   2008   2010   2012



           Big data




   Concurrency
                                           3
New Trends



 2002   2004   2006   2008   2010   2012



           Big data                        Connectivity




   Concurrency
                                                          3
New Trends



 2002   2004   2006   2008   2010   2012



           Big data                        Connectivity




   Concurrency                              Diversity
                                                          3
New Trends



 2002   2004   2006   2008   2010   2012



           Big data                        Connectivity   P2P Knowledge




   Concurrency                              Diversity
                                                                          3
New Trends



 2002   2004   2006   2008   2010   2012



           Big data                        Connectivity   P2P Knowledge




   Concurrency                              Diversity      Cloud-Grid
                                                                          3
What’s the problem with RDBMS’s?




  http://www.codefutures.com/database-sharding


                                                 4
What’s the problem with RDBMS’s?
                                                 Caching
                                                 Master/Slave
                                                 Master/Master
                                                 Cluster
                                                 Table Partitioning
                                                 Federated Tables
                                                 Sharding
  http://www.codefutures.com/database-sharding   Distributed DBs
                                                                      4
What’s the problem with RDBMS’s?




                                   5
What’s the problem with RDBMS’s?




http://www.flickr.com/photos/dimi3/3096166092   5
Quick Comparison
            Overview from 10,000 feet
  (random impressions from the interwebs)




                                            6
Quick Comparison
            Overview from 10,000 feet
  (random impressions from the interwebs)




           http://www.flickr.com/photos/42433826@N00/4914337851




                                                                 6
MongoDB is web-scale




        ...but /dev/null
          is even better!
                            7
Cassandra is teh schnitz




        ..Love,v/null
          .but /de
          is even better
                           8
CouchDB: Relax!




         .buve/a LO n ?
             t de      ulfree space
       ..Love,v/T of!l !?
      You harenaieth?r Right?
               l, r x te?
         isyevay b g t
        an  we
                                      9
No, seriously...*

(*) Not another “Mine is bigger” comparison, please

                                                      10
A little theory
      Fundamental Principles
  of (Distributed) Databases




     http://www.timbarcz.com/blog/PassionInProgrammers.aspx



                                                              11
ACID

ATOMICITY: All or nothing
CONSISTENCY: Any transaction will take the db from one
consistent state to another, with no broken constraints
(referential integrity)

ISOLATION: Other operations cannot access data that has
been modified during a transaction that has not yet completed

DURABILITY: Ability to recover the committed transaction
updates against any kind of system failure (transaction log)
                                                               12
Isolation Levels, Locking & MVCC
Isolation                   noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations




                                                                  13
Isolation Levels, Locking & MVCC
Isolation                    noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations

  SERIALIZABLE
  All transactions occur in a
  completely isolated fashion, as
  if they were executed serially




                                                                  13
Isolation Levels, Locking & MVCC
Isolation                    noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations

  SERIALIZABLE                      REPEATABLE READ
  All transactions occur in a       Multiple SELECT statements
  completely isolated fashion, as   issued in the same transaction
  if they were executed serially    will always yield the same
                                    result




                                                                  13
Isolation Levels, Locking & MVCC
Isolation                    noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations

  SERIALIZABLE                      REPEATABLE READ
  All transactions occur in a       Multiple SELECT statements
  completely isolated fashion, as   issued in the same transaction
  if they were executed serially    will always yield the same
                                    result
  READ COMMITTED
  A lock is acquired only on the
  rows currently read/updated

                                                                  13
Isolation Levels, Locking & MVCC
Isolation                    noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations

  SERIALIZABLE                      REPEATABLE READ
  All transactions occur in a       Multiple SELECT statements
  completely isolated fashion, as   issued in the same transaction
  if they were executed serially    will always yield the same
                                    result
  READ COMMITTED                    READ UNCOMMITTED
  A lock is acquired only on the    A transaction can access
  rows currently read/updated       uncommitted changes made
                                    by other transactions
                                                                  13
Isolation Levels, Locking & MVCC
                                            Non-repeatable
Isolation Level       Dirty Reads                                        Phantoms
                                                reads

Serializable                  -                        -                       -

Repeatable
                              -                        -
Read
Read
                              -
Committed
Read
Uncommitted

           http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels/        14
Isolation Levels, Locking & MVCC

Isolation Level   Range Lock   Read Lock   Write Lock


Serializable

Repeatable
                                               -
Read
Read
                                   -           -
Committed
Read
                      -            -           -
Uncommitted

                                                        15
Multi-Version Concurrency Control
                                            Root


                                    Index


         Index                      Index                 Index


Index         Index      Index        Index        Index          Index      Index
Data

       Data

               Data

                      Data

                             Data

                                     Data

                                            Data

                                                   Data

                                                           Data

                                                                   Data

                                                                          Data

                                                                                 Data
                                                                                        16
Multi-Version Concurrency Control
       obsolete                             Root
       new version

                                    Index                 Index


         Index                      Index                  Index                 Index


Index         Index      Index        Index        Index          Index      Index          Index
Data

       Data

               Data

                      Data

                             Data

                                     Data

                                            Data

                                                   Data

                                                           Data

                                                                   Data

                                                                          Data

                                                                                 Data



                                                                                         Data
                                                                                                16
Multi-Version Concurrency Control
       obsolete                             Root                                 atomic pointer update
       new version

                                    Index                 Index


         Index                      Index                  Index                    Index


Index         Index      Index        Index        Index          Index      Index                  Index
Data

       Data

               Data

                      Data

                             Data

                                     Data

                                            Data

                                                   Data

                                                           Data

                                                                   Data

                                                                          Data

                                                                                   Data



                                                                                                 Data
                                                                                                         16
Multi-Version Concurrency Control
       obsolete                             Root                                 atomic pointer update
       new version                                                                          marked for
                                                                                            compaction
                                    Index                 Index


         Index                      Index                  Index                    Index


Index         Index      Index        Index        Index          Index      Index                  Index
Data

       Data

               Data

                      Data

                             Data

                                     Data

                                            Data

                                                   Data

                                                           Data

                                                                   Data

                                                                          Data

                                                                                   Data



                                                                                                 Data
                                                                                                         16
Multi-Version Concurrency Control
       obsolete                             Root                                 atomic pointer update
       new version                                                                          marked for
                                                                                            compaction
                                    Index                 Index
                                                                                                Reads:
                                                                                                 never
                                                                                                blocked
         Index                      Index                  Index                    Index


Index         Index      Index        Index        Index          Index      Index                  Index
Data

       Data

               Data

                      Data

                             Data

                                     Data

                                            Data

                                                   Data

                                                           Data

                                                                   Data

                                                                          Data

                                                                                   Data



                                                                                                 Data
                                                                                                          16
Distributed Transactions - 2PC
       Coordinator
                          1) COMMIT
                             REQUEST
                             PHASE
                            (voting phase)




       Participants



                                         17
Distributed Transactions - 2PC
       Coordinator
                                        1) COMMIT
                                           REQUEST
                                           PHASE
                                         (voting phase)
                      Query to commit




       Participants



                                                      17
Distributed Transactions - 2PC
              Coordinator
                                                1) COMMIT
                                                   REQUEST
                                                   PHASE
                                                 (voting phase)




              Participants
 1) Exec Transaction up to the COMMIT request
 2) Write entry to undo and redo logs
                                                              17
Distributed Transactions - 2PC
       Coordinator
                              1) COMMIT
                                 REQUEST
                                 PHASE
                               (voting phase)
                      Agree
                       or
                      Abort




       Participants



                                            17
Distributed Transactions - 2PC
       Coordinator
                          2) COMMIT
                             PHASE
                            (completion
                             phase)

                          a) SUCCESS
                             (agreement
                              from all)



       Participants



                                          18
Distributed Transactions - 2PC
       Coordinator
                               2) COMMIT
                                  PHASE
                                 (completion
                                  phase)
                      Commit
                               a) SUCCESS
                                  (agreement
                                   from all)



       Participants



                                               18
Distributed Transactions - 2PC
        Coordinator
                             2) COMMIT
                                PHASE
                               (completion
                                phase)

                             a) SUCCESS
                                (agreement
                                 from all)



        Participants
     1) Complete operation
     2) Release locks
                                             18
Distributed Transactions - 2PC
       Coordinator
                                    2) COMMIT
                                       PHASE
                                      (completion
                                       phase)
                      Acknowledge
                                    a) SUCCESS
                                       (agreement
                                        from all)



       Participants



                                                    18
Distributed Transactions - 2PC
       Coordinator
                                             2) COMMIT
                                                PHASE
                      Complete transaction     (completion
                                                phase)

                                             a) SUCCESS
                                                (agreement
                                                 from all)



       Participants



                                                             18
Distributed Transactions - 2PC
       Coordinator
                          2) COMMIT
                             PHASE
                            (completion
                             phase)

                          b) FAILURE
                             (abort from
                              any)



       Participants



                                           19
Distributed Transactions - 2PC
       Coordinator
                                 2) COMMIT
                                    PHASE
                                   (completion
                                    phase)
                      Rollback
                                 b) FAILURE
                                    (abort from
                                     any)



       Participants



                                                  19
Distributed Transactions - 2PC
        Coordinator
                          2) COMMIT
                             PHASE
                            (completion
                             phase)

                          b) FAILURE
                             (abort from
                              any)



        Participants
     1) Undo operation
     2) Release locks
                                           19
Distributed Transactions - 2PC
       Coordinator
                                    2) COMMIT
                                       PHASE
                                      (completion
                                       phase)
                      Acknowledge
                                    b) FAILURE
                                       (abort from
                                        any)



       Participants



                                                     19
Distributed Transactions - 2PC
       Coordinator
                                         2) COMMIT
                                            PHASE
                      Undo transaction     (completion
                                            phase)

                                         b) FAILURE
                                            (abort from
                                             any)



       Participants



                                                          19
Problems with 2PC



                   Blocking Protocol




Risk of indefinite cohort      Conservative behaviour
blocks if coordinator fails   biased to the abort case

                                                         20
Paxos Algorithm (Consensus)
 Family of Fault-tolerant, distributed implementations
 Spectrum of trade-offs:
  Number of processors
  Number of message delays
  Activity level of participants
  Number of messages sent
  Types of failures
                                         http://www.usenix.org/event/nsdi09/tech/full_papers/yabandeh/yabandeh_html/




                      http://en.wikipedia.org/wiki/Paxos_algorithm                                              21
[PSE image alert]


                    22
ACID & Distributed Systems




    http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb   23
ACID & Distributed Systems


         ACID properties are always desirable

                           But what about:
                            Latency
                            Partition Tolerance
                            High Availability
                                               ?

    http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb   23
CAP Theorem (Brewer’s conjecture)
 2000 Prof. Eric Brewer, PoDC Conference Keynote
 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)



                     Of three properties of shared-data systems -
                     data Consistency, system Availability and
                     tolerance to network Partitions - only two can
                     be achieved at any given moment in time.



http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf   http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf

                                                                                                                                 24
CAP Theorem (Brewer’s conjecture)
 2000 Prof. Eric Brewer, PoDC Conference Keynote
 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)



                     Of three properties of shared-data systems -
                     data Consistency, system Availability and
                     tolerance to network Partitions - only two can
                     be achieved at any given moment in time.



http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf   http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf

                                                                                                                                 24
Partition Tolerance - Availability
 “The network will be allowed to lose arbitrarily many messages
 sent from one node to another” [...]
 “For a distributed system to be continuously available, every
 request received by a non-failing node in the system must result
 in a response”                    - Gilbert and Lynch, SIGACT 2002




http://codahale.com/you-cant-sacrifice-partition-tolerance   http://pl.atyp.us/wordpress/?p=2521   25
Partition Tolerance - Availability
 “The network will be allowed to lose arbitrarily many messages
 sent from one node to another” [...]
 “For a distributed system to be continuously available, every
 request received by a non-failing node in the system must result
 in a response”                    - Gilbert and Lynch, SIGACT 2002




http://codahale.com/you-cant-sacrifice-partition-tolerance   http://pl.atyp.us/wordpress/?p=2521   25
Partition Tolerance - Availability
 “The network will be allowed to lose arbitrarily many messages
 sent from one node to another” [...]
 “For a distributed system to be continuously available, every
 request received by a non-failing node in the system must result
 in a response”                    - Gilbert and Lynch, SIGACT 2002

                                             CP: requests can complete at nodes
                                             that have quorum
                                             AP: requests can complete at any
                                             live node, possibly violating strong
                                             consistency

http://codahale.com/you-cant-sacrifice-partition-tolerance   http://pl.atyp.us/wordpress/?p=2521   25
Partition Tolerance - Availability
 “The network will be allowed to lose arbitrarily many messages
 sent from one node to another” [...]
 “For a distributed system to be continuously available, every
 request received by a non-failing node in the system must result
 in a response”                    - Gilbert and Lynch, SIGACT 2002

                                             CP: requests can complete at nodes
                                             that have quorum
                                                    HIGH LATENCY
                                             AP: requests can complete at any     ≈
                                             live node, possiblyPARTITION
                                                NETWORK violating strong
                                             consistency
                                               http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html


http://codahale.com/you-cant-sacrifice-partition-tolerance         http://pl.atyp.us/wordpress/?p=2521                             25
Consistency: Client-side view
A service that is consistent operates fully or not at all.
  Strong consistency (as in ACID)
  Weak consistency (no guarantee) - Inconsistency window




                                                               (*) Temporary inconsistencies
                                                                   (e.g. in data constraints or
                                                                   replica versions) are
                                                                   accepted, but they’re resolved
                                                                   at the earliest opportunity
          http://www.allthingsdistributed.com/2008/12/eventually_consistent.html              26
Consistency: Client-side view
A service that is consistent operates fully or not at all.
  Strong consistency (as in ACID)
  Weak consistency (no guarantee) - Inconsistency window
      Eventual* consistency (e.g. DNS)
         Causal consistency
         Read-your-writes consistency
         (the least surprise)
         Session consistency                                   (*) Temporary inconsistencies
                                                                   (e.g. in data constraints or
         Monotonic read consistency                                replica versions) are
                                                                   accepted, but they’re resolved
         Monotonic write consistency                               at the earliest opportunity
          http://www.allthingsdistributed.com/2008/12/eventually_consistent.html              26
Consistency: Server-side (Quorum)
N = number of nodes with a replica of the data
                                                         (*)
W = number of replicas that must acknowledge the update
R = minimum number of replicas that must participate in a
    successful read operation
                        (*) but the data will be written to N nodes no matter what


W+R>N               Strong consistency (usually N=3, W=R=2)
W = N, R =1         Optimised for reads
W = 1, R = N        Optimised for writes
                    (durability not guaranteed in presence of failures)


W + R <= N          Weak consistency
                                                                                     27
Amazon Dynamo Paper
 Consistent Hashing


 Vector Clocks


 Gossip Protocol


 Hinted Handoffs


 Read Repair

http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf   28
Modulo-based Hashing

    N1      N2         N3   N4




                                 29
Modulo-based Hashing

    N1      N2         N3   N4

                 ?




                                 29
Modulo-based Hashing

    N1           N2        N3          N4




         partition = key % n_servers




                                            29
Modulo-based Hashing

    N1           N2        N3        N4




         partition = key % n_servers - 1)
                           (n_servers




                                            29
Modulo-based Hashing

      N1                N2              N3             N4




           partition = key % n_servers - 1)
                             (n_servers


  Recalculate the hashes for all the entries if n_servers changes
  (i.e. full data redistribution when adding/removing a node)
                                                                    29
Consistent Hashing
                2160    0

                            A



     F                                       B


                Ring                                                     Same hash function
 E           (key space)                                                 for data and nodes
                                                                         idx = hash(key)

         D
                                                                         Coordinator: next
                                C
                                                                         available clockwise
                                                                                 node
                       http://en.wikipedia.org/wiki/Consistent_hashing                    30
Consistent Hashing
                2160    0

                            A



     F                                       B


                Ring                                                     Same hash function
 E           (key space)                                                 for data and nodes
                                                                         idx = hash(key)

         D
                                                                         Coordinator: next
                                C
                                                                         available clockwise
                                                                                 node
                       http://en.wikipedia.org/wiki/Consistent_hashing                    30
Consistent Hashing
                2160    0

                            A

                                                     canonical home
                                                  (coordinator node)
                                                   for key range A-B
     F                                       B


                Ring                                                     Same hash function
 E           (key space)                                                 for data and nodes
                                                                         idx = hash(key)

         D
                                                                         Coordinator: next
                                C
                                                                         available clockwise
                                                                                 node
                       http://en.wikipedia.org/wiki/Consistent_hashing                    30
Consistent Hashing
                2160    0

                            A



     F                                       B


                Ring                                                     Same hash function
 E           (key space)                                                 for data and nodes
                                                                         idx = hash(key)

         D
                                                                         Coordinator: next
                                C
                                                                         available clockwise
                                                                                 node
                       http://en.wikipedia.org/wiki/Consistent_hashing                    30
Consistent Hashing
                2160    0

                            A



     F                                       B


                Ring                                                     Same hash function
 E           (key space)                                                 for data and nodes
                                                                         idx = hash(key)

         D
                                                                         Coordinator: next
                                C       canonical home
                                      for key range A-C                  available clockwise
                                                                                 node
                       http://en.wikipedia.org/wiki/Consistent_hashing                    30
Consistent Hashing
                2160    0
                                                                  only the keys in this
                                                                  range change location
                            A



     F                                       B


                Ring                                                     Same hash function
 E           (key space)                                                 for data and nodes
                                                                         idx = hash(key)

         D
                                                                         Coordinator: next
                                C       canonical home
                                      for key range A-C                  available clockwise
                                                                                 node
                       http://en.wikipedia.org/wiki/Consistent_hashing                    30
Consistent Hashing - Replication

                       A



     F                                   B


                Ring
 E           (key space)

         D
                            C



                    http://horicky.blogspot.com/2009/11/nosql-patterns.html   31
Consistent Hashing - Replication
                                               Key hosted
                                                       AB


                       A                        in B, C, D

     F                                   B
                                                                 Data replicated in
                Ring                                             the N-1 clockwise
 E           (key space)                                          successor nodes

         D
                            C                    Node hosting
                                                Key , Key , Key
                                                        FA          AB        BC



                    http://horicky.blogspot.com/2009/11/nosql-patterns.html           31
Consistent Hashing - Node Changes

             A



     F               B



 E




         D
                 C



                                    32
Consistent Hashing - Node Changes
                                                 Key membership
                        A                         and replicas are
                                                  updated when a
     F                          B
                                               node joins or leaves
             Copy Key                              the network.
             Range AB                             The number of
 E
                Copy Key                       replicas for all data
                Range FA                        is kept consistent.
         D
                            C       Copy Key
                                    Range EF
                                                                  32
Consistent Hashing - Load Distribution
                    2160   0
                                               Different Strategies
                               A
             I
                                                      Virtual Nodes
     H                                 B
                                               Random tokens per each
                    Ring                       physical node, partition by
                                           C
 G               (key space)                   token value
                                       D

                                                   Node 1: tokens A, E, G
         F                                         Node 2: tokens C, F, H
                                   E               Node 3: tokens B, D, I



                                                                             33
Consistent Hashing - Load Distribution
         2160   0
                     Different Strategies

                            Virtual Nodes

                     Q equal-sized partitions,
         Ring        S nodes, Q/S tokens per
      (key space)    node (with Q >> S)
                                  Node 1
                                  Node 2
                                  Node 3
                                  Node 4
                                    ...
                                                 34
Vector Clocks & Conflict Detection
A       B      C                            write handled by A
                                                                                      Causality-based partial
                                                                                      order over events that
                              D1 ([A, 1])                                             happen in the system.

                                                                                      Document version
                                                                                      history: a counter for
                                                                                      each node that updated
                                                                                      the document.

                                                                                      If all update counters in
                                                                                      V1 are smaller or equal
                                                                                      to all update counters in
                                                                                      V2, then V1 precedes V2.
    http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601              35
Vector Clocks & Conflict Detection
A       B      C                            write handled by A
                                                                                      Causality-based partial
                                                                                      order over events that
                              D1 ([A, 1])                                             happen in the system.
                                            write handled by A
                                                                                      Document version
                              D2 ([A, 2])                                             history: a counter for
                                                                                      each node that updated
                                                                                      the document.

                                                                                      If all update counters in
                                                                                      V1 are smaller or equal
                                                                                      to all update counters in
                                                                                      V2, then V1 precedes V2.
    http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601              35
Vector Clocks & Conflict Detection
 A       B      C                            write handled by A
                                                                                       Causality-based partial
                                                                                       order over events that
                               D1 ([A, 1])                                             happen in the system.
                                             write handled by A
                                                                                       Document version
                               D2 ([A, 2])                                             history: a counter for
                                                                                       each node that updated
  write handled by B                                 write handled by C
                                                                                       the document.
D3 ([A, 2], [B, 1])                              D4 ([A, 2], [C,1])                    If all update counters in
                                                                                       V1 are smaller or equal
                                                                                       to all update counters in
                                                                                       V2, then V1 precedes V2.
     http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601              35
Vector Clocks & Conflict Detection
 A       B      C                            write handled by A
                                                                                       Causality-based partial
                                                                                       order over events that
                               D1 ([A, 1])                                             happen in the system.
                                             write handled by A
                                                                                       Document version
                               D2 ([A, 2])                                             history: a counter for
                                                                                       each node that updated
  write handled by B                                 write handled by C
                                                                                       the document.
D3 ([A, 2], [B, 1])                              D4 ([A, 2], [C,1])                    If all update counters in
                                                                                       V1 are smaller or equal
       conflict detected                           reconciliation handled by A
                                                                                       to all update counters in
                                          ?                                            V2, then V1 precedes V2.
     http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601              35
Vector Clocks & Conflict Detection
 A       B      C                            write handled by A
                                                                                       Causality-based partial
                                                                                       order over events that
                               D1 ([A, 1])                                             happen in the system.
                                             write handled by A
                                                                                       Document version
                               D2 ([A, 2])                                             history: a counter for
                                                                                       each node that updated
  write handled by B                                 write handled by C
                                                                                       the document.
D3 ([A, 2], [B, 1])                              D4 ([A, 2], [C,1])                    If all update counters in
                                                                                       V1 are smaller or equal
       conflict detected                           reconciliation handled by A
                                                                                       to all update counters in
             D5 ([A, 3], [B, 1], [C,1])
                           ?                                                           V2, then V1 precedes V2.
     http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601              35
Vector Clocks & Conflict Detection
A       B      C                            write handled by A
                                                                                      Vector Clocks can detect
                                                                                      a conflict. The conflict
                              D1 ([A, 1])                                             resolution is left to the
                                                                                      application or the user.

                                                                                      The application might
                                                                                      resolve conflicts by
                                                                                      checking relative
                                                                                      timestamps, or with
                                                                                      other strategies (like
                                                                                      merging the changes).

                                                                                      Vector clocks can grow
                                                                                      quite large (!)
    http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601           36
Vector Clocks & Conflict Detection
A       B      C                            write handled by A
                                                                                      Vector Clocks can detect
                                                                                      a conflict. The conflict
                              D1 ([A, 1])                                             resolution is left to the
                                            write handled by A                        application or the user.

                                                                                      The application might
                              D2 ([A, 2])
                                                                                      resolve conflicts by
                                                                                      checking relative
                                                                                      timestamps, or with
                                                                                      other strategies (like
                                                                                      merging the changes).

                                                                                      Vector clocks can grow
                                                                                      quite large (!)
    http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601           36
Vector Clocks & Conflict Detection
 A       B      C                            write handled by A
                                                                                       Vector Clocks can detect
                                                                                       a conflict. The conflict
                               D1 ([A, 1])                                             resolution is left to the
                                             write handled by A                        application or the user.

                                                                                       The application might
                               D2 ([A, 2])
                                                                                       resolve conflicts by
  write handled by B                                 un-modified replica                checking relative
                                                                                       timestamps, or with
D3 ([A, 2], [B, 1])                                D4 ([A, 2])                         other strategies (like
                                                                                       merging the changes).

                                                                                       Vector clocks can grow
                                                                                       quite large (!)
     http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601           36
Vector Clocks & Conflict Detection
 A       B      C                            write handled by A
                                                                                       Vector Clocks can detect
                                                                                       a conflict. The conflict
                               D1 ([A, 1])                                             resolution is left to the
                                             write handled by A                        application or the user.

                                                                                       The application might
                               D2 ([A, 2])
                                                                                       resolve conflicts by
  write handled by B                                 un-modified replica                checking relative
                                                                                       timestamps, or with
D3 ([A, 2], [B, 1])                                D4 ([A, 2])                         other strategies (like
                                                                                       merging the changes).
       version mismatch                                D3 ⊇ D4, conflict
            detected                                resolved automatically
                                                                                       Vector clocks can grow
                       D5 ([A, 3], [B, 1])                                             quite large (!)
     http://en.wikipedia.org/wiki/Vector_clock                    http://pl.atyp.us/wordpress/?p=2601           36
Gossip Protocol + Hinted Handoff

             A


                         periodic, pairwise,
     F               B     inter-process
                          interactions of
                           bounded size
 E                       among randomly-
                           chosen peers

         D
                 C



                                          37
Gossip Protocol + Hinted Handoff

                                  A


              I can’t see B, it might be                             periodic, pairwise,
     F       down but I need some          B                           inter-process
             ACK. My Merkle Tree
             root for range XY is                                     interactions of
              “ab031dab4a385afda”                                      bounded size
 E                                                                   among randomly-
                 I can’t see B either.
                 My Merkle Tree root for                               chosen peers
                 range XY is different!

         D
                                       C
                                           B must be down
                                           then. Let’s disable it.

                                                                                      37
Gossip Protocol + Hinted Handoff
                     My canonical node is
                     supposed to be B.
             A


                                                   periodic, pairwise,
     F               B                               inter-process
                                                    interactions of
                                                     bounded size
 E                                                 among randomly-
                                                     chosen peers

         D                I see. Well, I’ll take care of it
                         for now, and let B know
                 C
                          when B is available again


                                                                    37
Merkle Trees (Hash Trees)
                                                                             Leaves: hashes of
                               ROOT
                               hash(A, B)                                    data blocks.
                                                                             Nodes: hashes of
                                                                             their children.
             A                                         B
          hash(C, D)                                hash(E, F)
                                                                             Used to detect
                                                                             inconsistencies
    C                   D                    E                      F        between replicas
 hash(001)        hash(002)            hash(003)               hash(004)
                                                                             (anti-entropy) and
                                                                             to minimise the
  Data                 Data                 Data                  Data
  Block                Block                Block                 Block      amount of
  001                  002                  003                   004        transferred data
                                    http://en.wikipedia.org/wiki/Hash_tree                        38
Read Repair

              A



     F                B
                          GET(k, R=2)

 E




         D
                  C



                                   39
Read Repair

              A



     F                B
                          GET(k, R=2)

 E




         D
                  C



                                   39
Read Repair
                      k=XYZ (v.2)
              A
                            k=XYZ (v.2)

     F                  B
                                          GET(k, R=2)

 E




         D
                  C

                         k=ABC (v.1)
                                                   39
Read Repair

              A



     F                B
                                      k=XYZ (v.2)

 E



                          UPDATE(k = XYZ)
         D
                  C



                                               39
NoSQL Break-down
      Key-value stores, Column Families,
 Document-oriented dbs, Graph databases




                                           40
Focus Of Different Data Models

              Key-Value
               Stores
Size



                                     Column
                                     Families
                                                            Document
                                                            Databases
                                                                                        Graph
                                                                                       Databases




                                                                             Complexity
       http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases   41
1) Key-value stores
                 Amazon Dynamo Paper
 Data model: collection of key-value pairs




                                             42
Voldemort                                AP

                                     LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks    Apache 2
                                     LANGUAGE

                                      Java
                                   API/PROTOCOL
                                   HTTP Java
                                     Thrift
                                      Avro
                                    Protobuf
                                   PERSISTENCE

                                   Pluggable
                                   BDB/MySQL
                                   CONCURRENCY
                                      MVCC




                                             43
Voldemort                                                 AP

                                                      LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks                     Apache 2
                                                      LANGUAGE

                                                       Java
                                   HTTP / Sockets   API/PROTOCOL
                                                    HTTP Java
                                                      Thrift
                                                       Avro
                                                     Protobuf
                                                    PERSISTENCE

                                                    Pluggable
                                                    BDB/MySQL
                                                    CONCURRENCY
                                                       MVCC




                                                              43
Voldemort                                                        AP

                                                             LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks                            Apache 2
                                                             LANGUAGE

                                                              Java
                                                           API/PROTOCOL
                               Conflicts resolved at read   HTTP Java
                                   and write time            Thrift
                                                              Avro
                                                            Protobuf
                                                           PERSISTENCE

                                                           Pluggable
                                                           BDB/MySQL
                                                           CONCURRENCY
                                                              MVCC




                                                                     43
Voldemort                                                             AP

                                                                  LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks                                 Apache 2
                                                                  LANGUAGE

                                                                   Java
                                                                API/PROTOCOL
                                                                HTTP Java
                                                                  Thrift
                                   Json, Java String, byte[],      Avro
                                                                 Protobuf
                                    Thrift, Avro, ProtoBuf
                                                                PERSISTENCE

                                                                Pluggable
                                                                BDB/MySQL
                                                                CONCURRENCY
                                                                   MVCC




                                                                          43
Voldemort                                                        AP

                                                             LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks                            Apache 2
                                                             LANGUAGE

                                                              Java
                                                           API/PROTOCOL
                                                           HTTP Java
                                                             Thrift
                                                              Avro
                                                            Protobuf
                                                           PERSISTENCE

                                                           Pluggable
                                                           BDB/MySQL
                                                           CONCURRENCY
                                                              MVCC
                               Simple optimistic locking
                                for multi-row updates,
                               pluggable storage engine
                                                                     43
Voldemort                                AP

                                     LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks    Apache 2
                                     LANGUAGE

                                      Java
                                   API/PROTOCOL
                                   HTTP Java
                                     Thrift
                                      Avro
                                    Protobuf
                                   PERSISTENCE

                                   Pluggable
                                   BDB/MySQL
                                   CONCURRENCY
                                      MVCC




                                             43
Membase                                                                               CP

                                                                                  LICENSE
DHT (K-V), no SPoF
                                                                                 Apache 2
           “VBuckets”                                                             LANGUAGE

                                                                                   C/C++
     membase            memcached                                                 Erlang
                                                                                API/PROTOCOL
      persistence         distributed
      replication         in-memory                                             REST/JSON
    (fail-over HA)                                                              memcached
      rebalancing

    Unit of consistency and replication
 Owner of a subset of the cluster key space




                  http://dustin.github.com/2010/06/29/memcached-vbuckets.html            44
Membase                                                                                CP

                                                                                   LICENSE
DHT (K-V), no SPoF
                                                                                  Apache 2
           “VBuckets”                                                              LANGUAGE

                                                                                    C/C++
     membase            memcached                                                  Erlang
                                                                                 API/PROTOCOL
      persistence         distributed
      replication         in-memory                                              REST/JSON
    (fail-over HA)                                                               memcached
      rebalancing

    Unit of consistency and replication
 Owner of a subset of the cluster key space


                                                  hash function + table lookup




                  http://dustin.github.com/2010/06/29/memcached-vbuckets.html             44
Membase                                                                                 CP

                                                                                    LICENSE
DHT (K-V), no SPoF
                                                                                   Apache 2
            “VBuckets”                                                              LANGUAGE

                                                                                     C/C++
      membase            memcached                                                  Erlang
                                                                                  API/PROTOCOL
       persistence         distributed
       replication         in-memory                                              REST/JSON
     (fail-over HA)                                                               memcached
       rebalancing

     Unit of consistency and replication
  Owner of a subset of the cluster key space


                                                   hash function + table lookup

All metadata kept in memory (high throughput / low latency).
Manual/Programmatic failover via the Management REST API.
                   http://dustin.github.com/2010/06/29/memcached-vbuckets.html             44
Riak                                                  AP

                                                  LICENSE



                                                 Apache 2
                                                  LANGUAGE

                                                C, Erlang
                                                API/PROTOCOL

                                                REST HTTP
                                                     *
                                                 ProtoBuf




                   Buckets → K-V
                 “Links” (~relations)
              Targeted JS Map/Reduce
       Tune-able consistency (one-quorum-all)
                                                         45
Redis                                                                     CP

                                                                      LICENSE
K-V store “Data Structures Server”
                                                                        BSD
Map, Set, Sorted Set, Linked List                                     LANGUAGE
Set/Queue operations, Counters, Pub-Sub, Volatile keys                ANSI C
                                                                        API

                                                                         *
                            +                                         PROTOCOL

                                                                     Telnet-
                                                                       like
                                                                    PERSISTENCE
     10-100K op/s (whole dataset in RAM + VM)
                                                                    in memory
                                                                    bg snapshots
      Persistence via snapshotting (tunable fsync freq.)            REPLICATION

                                                                    master-slave

      Distributed if client supports consistent hashing
                   http://redis.io/presentation/Redis_Cluster.pdf             46
2) Column Families
                 Google BigTable paper
   Data model: big table, column families




                                            47
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”               “anchor:cnnsi.com”            “anchor:my.look.ca”




 “com.cnn.www”         <html>...                         “CNN”                   “CNN.com”
                       column                            column                    column
row_key          row


                         http://labs.google.com/papers/bigtable-osdi06.pdf                         48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”               “anchor:cnnsi.com”            “anchor:my.look.ca”




 “com.cnn.www”         <html>...                         “CNN”                   “CNN.com”
                       column                            column                    column
row_key          row


                         http://labs.google.com/papers/bigtable-osdi06.pdf                         48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”               “anchor:cnnsi.com”            “anchor:my.look.ca”




 “com.cnn.www”         <html>...                         “CNN”                   “CNN.com”
                       column                            column                    column
row_key          row


                         http://labs.google.com/papers/bigtable-osdi06.pdf                         48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”                    “anchor:cnnsi.com”             “anchor:my.look.ca”



                         <html>...             t3
                        <html>...           t5
 “com.cnn.www”         <html>...          t6              “CNN”               t9       “CNN.com”     t8
                       column                             column                         column
row_key          row


                          http://labs.google.com/papers/bigtable-osdi06.pdf                               48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”                    “anchor:cnnsi.com”             “anchor:my.look.ca”



                         <html>...             t3
                        <html>...           t5
 “com.cnn.www”         <html>...          t6              “CNN”               t9       “CNN.com”     t8
                       column                             column                         column
row_key          row                                                 column family


                          http://labs.google.com/papers/bigtable-osdi06.pdf                               48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”                    “anchor:cnnsi.com”             “anchor:my.look.ca”



                         <html>...             t3
                        <html>...           t5
 “com.cnn.www”         <html>...          t6              “CNN”               t9       “CNN.com”     t8
                       column                             column                         column
row_key          row                                                 column family

                                                                                               ACL
                          http://labs.google.com/papers/bigtable-osdi06.pdf                               48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”                    “anchor:cnnsi.com”             “anchor:my.look.ca”



                         <html>...             t3
                        <html>...           t5
 “com.cnn.www”         <html>...          t6              “CNN”               t9       “CNN.com”     t8
                       column                             column                         column
row_key          row                                                 column family

Atomic updates                                                                                 ACL
                          http://labs.google.com/papers/bigtable-osdi06.pdf                               48
Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)




 CF:col_name     “contents:html”                    “anchor:cnnsi.com”             “anchor:my.look.ca”



                         <html>...             t3
                        <html>...           t5
 “com.cnn.www”         <html>...          t6              “CNN”               t9       “CNN.com”     t8
                       column                             column                         column
row_key          row                                                 column family

Atomic updates                                      Automatic GC                               ACL
                          http://labs.google.com/papers/bigtable-osdi06.pdf                               48
Google BigTable: Data Structure
SSTable
Smallest building block
Persistent immutable Map[k,v]
Operations: lookup by key / key range scan




                          SSTable
  64KB    64KB    64KB
  block   block   block   lookup
                           index



                                             49
Google BigTable: Data Structure
SSTable
Tablet
Smallest building block range of rows
Dynamically partitioned
Persistent immutable Map[k,v]
Built from multiple SSTables
Operations: lookup and loadkey range scan
Unit of distribution by key / balancing



Tablet (range Aaa → Bar)

                           SSTable                           SSTable
  64KB    64KB     64KB              64KB    64KB    64KB
  block   block    block   lookup    block   block   block   lookup
                            index                             index



                                                                       49
Google BigTable: Data Structure
SSTable
Table
Tablet
Smallest Tablets (table segments) make up a table
Multiple building block
Dynamically partitioned range of rows
Persistent immutable Map[k,v]
Built from multiple SSTables
Operations: lookup and loadkey range scan
Unit of distribution by key / balancing

Table
Tablet (range Aaa → Bar)

                           SSTable                           SSTable
  64KB    64KB     64KB              64KB    64KB    64KB
  block   block    block   lookup    block   block   block   lookup
                            index                             index



                                                                       49
Google BigTable: I/O

           memtable              read

memory
GFS


           tablet log

                        SSTable SSTable SSTable
   write



                                                  50
Google BigTable: I/O

           memtable                  read
                          minor
                        compaction
memory
GFS


           tablet log

                          SSTable SSTable SSTable
   write



                                                    50
Google BigTable: I/O

           memtable                  read
                          minor
                        compaction
memory
GFS


           tablet log

                          SSTable SSTable SSTable
   write                                    BMDiff Zippy




                                                           50
Google BigTable: I/O

           memtable                    read
                            minor
                          compaction
memory
GFS


           tablet log

                            SSTable SSTable SSTable
   write                                      BMDiff Zippy




                        merging / major compaction (GC)
                                                             50
Google BigTable: Location Dereferencing

                                           Metadata Tablets   User Tables


                                                  ...               ...


                          Root Tablet
    Master File
                                                  ...
      Chubby                     ...                                ...

 Replicated, persisted
                            Root of the
lock service; maintains
                           metadata tree
tablet server locations

5 replicas, one elected                           ...
 master (via quorum)
                                             Up to 3 levels         ...

Paxos algorithm used                        in the metadata
 to keep consistency                            hierarchy


                                                                            51
Google BigTable: Architecture
                                                               fs metadata, ACL,
                                                              GC, load balancing

                BigTable     metadata operations               BigTable
                 client                                        master
   data R/W                                    heartbeat
   operations                                messages, GC,
                                            chunk migration


  Tablet        Tablet     Tablet                                Chubby
  Server        Server     Server                  track
                                                                 master lock,
                                                              log of live servers



  Tablet        Tablet     Tablet
                                                                                    52
HBase                                  CP

                                   LICENSE
OSS implementation of BigTable
                                  Apache 2
                                   LANGUAGE

                                    Java
                                 API/PROTOCOL

                                 REST HTTP
                                   Thrift
                                 PERSISTENCE

                                 memtable/
                                  SSTable




                                           53
HBase                                  CP

                                   LICENSE
OSS implementation of BigTable
                                  Apache 2
                                   LANGUAGE

                                    Java
     ZooKeeper as                API/PROTOCOL

     coordinator                 REST HTTP
                                   Thrift
(instead of Chubby)
                                 PERSISTENCE

                                 memtable/
                                  SSTable




                                           53
HBase                                                     CP

                                                      LICENSE
OSS implementation of BigTable
                                                     Apache 2
                                                      LANGUAGE

                                  Support for          Java
                                 multiple masters   API/PROTOCOL

                                                    REST HTTP
                                                      Thrift
                                                    PERSISTENCE

                                                    memtable/
                                                     SSTable




                                                              53
HBase                                  CP

                                   LICENSE
OSS implementation of BigTable
                                  Apache 2
                                   LANGUAGE

                                    Java
                                 API/PROTOCOL

                                 REST HTTP
                                   Thrift
                                 PERSISTENCE

                                 memtable/
                                  SSTable

HDFS, S3, S3N, EBS
 (with GZip/LZO
 CF compression)



                                           53
HBase                                            CP

                                             LICENSE
OSS implementation of BigTable
                                            Apache 2
                                             LANGUAGE

                                              Java
                                           API/PROTOCOL

                                           REST HTTP
                                             Thrift
                                           PERSISTENCE

                                           memtable/
                   Data sorted by key       SSTable
                  but evenly distributed
                    across the cluster




                                                     53
HBase                                                     CP

                                                      LICENSE
OSS implementation of BigTable
                                                     Apache 2
                                                      LANGUAGE

                                                       Java
                                                    API/PROTOCOL

                                                    REST HTTP
                                                      Thrift
                                                    PERSISTENCE

                                                    memtable/
                                                     SSTable
                                 Batch Streaming,
                                   Map/Reduce




                                                              53
HBase                                  CP

                                   LICENSE
OSS implementation of BigTable
                                  Apache 2
                                   LANGUAGE

                                    Java
                                 API/PROTOCOL

                                 REST HTTP
                                   Thrift
                                 PERSISTENCE

                                 memtable/
                                  SSTable




                                           53
Hypertable                              CP

                                    LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s)        GPLv2
                                    LANGUAGE

                                      C++
                                  API/PROTOCOL

                                      C++
                                    Thrift
                                  PERSISTENCE

                                  memtable/
                                   SSTable
                                  CONCURRENCY
                                     MVCC




                               HQL (~SQL)
                                            54
Hypertable                              CP

                                    LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s)        GPLv2
                                    LANGUAGE

                                      C++
                                  API/PROTOCOL

                                      C++
   Hyperspace                       Thrift
   (paxos) used                   PERSISTENCE

    instead of                    memtable/
                                   SSTable
   ZooKeeper
                                  CONCURRENCY
                                     MVCC




                               HQL (~SQL)
                                            54
Hypertable                                            CP

                                                  LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s)                      GPLv2
                                                  LANGUAGE

                                                    C++
                                                API/PROTOCOL

                                                    C++
                                                  Thrift
                               Dynamically      PERSISTENCE
                                adapts to
                                                memtable/
                               changes in        SSTable
                                workload        CONCURRENCY
                                                   MVCC




                                             HQL (~SQL)
                                                          54
Hypertable                              CP

                                    LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s)        GPLv2
                                    LANGUAGE

                                      C++
                                  API/PROTOCOL

                                      C++
                                    Thrift
                                  PERSISTENCE

                                  memtable/
                                   SSTable
                                  CONCURRENCY
                                     MVCC




                               HQL (~SQL)
                                            54
Cassandra                                                                                                AP

                                                                                                      LICENSE
      Data model of BigTable, infrastructure of Dynamo
                                                                                                    Apache 2
                                                                                                     LANGUAGE

                                                                                                       Java
                                                                                                     PROTOCOL
                                                   B
                                      col_name                                                       Thrift
                                                                                                      Avro
                                      col_value                                                     PERSISTENCE
                                      timestamp
                                                                                                    memtable/
                                                                                                     SSTable
                                      Column
                                                                                                    CONSISTENCY

                                                                                                     Tunable
                                                                                                      R/W/N

                      x




http://www.javageneration.com/?p=70    @cassandralondon   http://www.meetup.com/Cassandra-London/             55
Cassandra                                                                                                      AP

                                                                                                            LICENSE
      Data model of BigTable, infrastructure of Dynamo
                                                                                                          Apache 2
                                                                                                           LANGUAGE

                          super_column_name                                                                  Java
                                                                                                           PROTOCOL
                                                         B
                    col_name                col_name                                                       Thrift
                                      ...                                                                   Avro
                                                                                                          PERSISTENCE
                    col_value               col_value
                     timestamp              timestamp
                                                                                                          memtable/
                                                                                                           SSTable
                                                                                                          CONSISTENCY

                                                                                                           Tunable
                                                                                                            R/W/N

                      x




http://www.javageneration.com/?p=70          @cassandralondon   http://www.meetup.com/Cassandra-London/             55
Cassandra                                                                                                      AP

                                                                                                            LICENSE
      Data model of BigTable, infrastructure of Dynamo
                            Column Family                                                                 Apache 2
                                                                                                           LANGUAGE

                                                                                                             Java
                                                                                                           PROTOCOL
                                                         B
                    col_name                col_name                                                       Thrift
row_key
                                      ...                                                                   Avro
                                                                                                          PERSISTENCE
                    col_value               col_value
                     timestamp              timestamp
                                                                                                          memtable/
                                                                                                           SSTable
                                                                                                          CONSISTENCY

                                                                                                           Tunable
                                                                                                            R/W/N

                      x




http://www.javageneration.com/?p=70          @cassandralondon   http://www.meetup.com/Cassandra-London/             55
Cassandra                                                                                                       AP

                                                                                                             LICENSE
      Data model of BigTable, infrastructure of Dynamo
                                                   Super Column Family                                     Apache 2
                                                                                                            LANGUAGE

                          super_column_name                            super_column_name                      Java
                                                                                                            PROTOCOL
                                                         B
                    col_name                col_name               col_name             col_name            Thrift
row_key
                                      ...                  ...                    ...                        Avro
                    col_value               col_value              col_value            col_value          PERSISTENCE
                     timestamp              timestamp               timestamp            timestamp
                                                                                                           memtable/
                                                                                                            SSTable
                                                                                                           CONSISTENCY
keyspace.get("column_family",
key,
["super_column",]
"column")
                                                                                                            Tunable
                                                                                                             R/W/N

                      x




http://www.javageneration.com/?p=70          @cassandralondon    http://www.meetup.com/Cassandra-London/             55
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when

More Related Content

NoSQL Databases: Why, what and when

  • 1. Lorenzo Alberton @lorenzoalberton NoSQL Databases: Why, what and when NoSQL Databases Demystified PHP UK Conference, 25th February 2011 1
  • 3. New Trends 2002 2004 2006 2008 2010 2012 Big data 3
  • 4. New Trends 2002 2004 2006 2008 2010 2012 Big data Concurrency 3
  • 5. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity Concurrency 3
  • 6. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity Concurrency Diversity 3
  • 7. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity P2P Knowledge Concurrency Diversity 3
  • 8. New Trends 2002 2004 2006 2008 2010 2012 Big data Connectivity P2P Knowledge Concurrency Diversity Cloud-Grid 3
  • 9. What’s the problem with RDBMS’s? http://www.codefutures.com/database-sharding 4
  • 10. What’s the problem with RDBMS’s? Caching Master/Slave Master/Master Cluster Table Partitioning Federated Tables Sharding http://www.codefutures.com/database-sharding Distributed DBs 4
  • 11. What’s the problem with RDBMS’s? 5
  • 12. What’s the problem with RDBMS’s? http://www.flickr.com/photos/dimi3/3096166092 5
  • 13. Quick Comparison Overview from 10,000 feet (random impressions from the interwebs) 6
  • 14. Quick Comparison Overview from 10,000 feet (random impressions from the interwebs) http://www.flickr.com/photos/42433826@N00/4914337851 6
  • 15. MongoDB is web-scale ...but /dev/null is even better! 7
  • 16. Cassandra is teh schnitz ..Love,v/null .but /de is even better 8
  • 17. CouchDB: Relax! .buve/a LO n ? t de ulfree space ..Love,v/T of!l !? You harenaieth?r Right? l, r x te? isyevay b g t an we 9
  • 18. No, seriously...* (*) Not another “Mine is bigger” comparison, please 10
  • 19. A little theory Fundamental Principles of (Distributed) Databases http://www.timbarcz.com/blog/PassionInProgrammers.aspx 11
  • 20. ACID ATOMICITY: All or nothing CONSISTENCY: Any transaction will take the db from one consistent state to another, with no broken constraints (referential integrity) ISOLATION: Other operations cannot access data that has been modified during a transaction that has not yet completed DURABILITY: Ability to recover the committed transaction updates against any kind of system failure (transaction log) 12
  • 21. Isolation Levels, Locking & MVCC Isolation noun Property that defines how/when the changes made by one operation become visible to other concurrent operations 13
  • 22. Isolation Levels, Locking & MVCC Isolation noun Property that defines how/when the changes made by one operation become visible to other concurrent operations SERIALIZABLE All transactions occur in a completely isolated fashion, as if they were executed serially 13
  • 23. Isolation Levels, Locking & MVCC Isolation noun Property that defines how/when the changes made by one operation become visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result 13
  • 24. Isolation Levels, Locking & MVCC Isolation noun Property that defines how/when the changes made by one operation become visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result READ COMMITTED A lock is acquired only on the rows currently read/updated 13
  • 25. Isolation Levels, Locking & MVCC Isolation noun Property that defines how/when the changes made by one operation become visible to other concurrent operations SERIALIZABLE REPEATABLE READ All transactions occur in a Multiple SELECT statements completely isolated fashion, as issued in the same transaction if they were executed serially will always yield the same result READ COMMITTED READ UNCOMMITTED A lock is acquired only on the A transaction can access rows currently read/updated uncommitted changes made by other transactions 13
  • 26. Isolation Levels, Locking & MVCC Non-repeatable Isolation Level Dirty Reads Phantoms reads Serializable - - - Repeatable - - Read Read - Committed Read Uncommitted http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels/ 14
  • 27. Isolation Levels, Locking & MVCC Isolation Level Range Lock Read Lock Write Lock Serializable Repeatable - Read Read - - Committed Read - - - Uncommitted 15
  • 28. Multi-Version Concurrency Control Root Index Index Index Index Index Index Index Index Index Index Index Data Data Data Data Data Data Data Data Data Data Data Data 16
  • 29. Multi-Version Concurrency Control obsolete Root new version Index Index Index Index Index Index Index Index Index Index Index Index Index Index Data Data Data Data Data Data Data Data Data Data Data Data Data 16
  • 30. Multi-Version Concurrency Control obsolete Root atomic pointer update new version Index Index Index Index Index Index Index Index Index Index Index Index Index Index Data Data Data Data Data Data Data Data Data Data Data Data Data 16
  • 31. Multi-Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Index Index Index Index Index Index Index Index Index Index Index Index Data Data Data Data Data Data Data Data Data Data Data Data Data 16
  • 32. Multi-Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Reads: never blocked Index Index Index Index Index Index Index Index Index Index Index Index Data Data Data Data Data Data Data Data Data Data Data Data Data 16
  • 33. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Participants 17
  • 34. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Query to commit Participants 17
  • 35. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Participants 1) Exec Transaction up to the COMMIT request 2) Write entry to undo and redo logs 17
  • 36. Distributed Transactions - 2PC Coordinator 1) COMMIT REQUEST PHASE (voting phase) Agree or Abort Participants 17
  • 37. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) a) SUCCESS (agreement from all) Participants 18
  • 38. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Commit a) SUCCESS (agreement from all) Participants 18
  • 39. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) a) SUCCESS (agreement from all) Participants 1) Complete operation 2) Release locks 18
  • 40. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Acknowledge a) SUCCESS (agreement from all) Participants 18
  • 41. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE Complete transaction (completion phase) a) SUCCESS (agreement from all) Participants 18
  • 42. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) b) FAILURE (abort from any) Participants 19
  • 43. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Rollback b) FAILURE (abort from any) Participants 19
  • 44. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) b) FAILURE (abort from any) Participants 1) Undo operation 2) Release locks 19
  • 45. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE (completion phase) Acknowledge b) FAILURE (abort from any) Participants 19
  • 46. Distributed Transactions - 2PC Coordinator 2) COMMIT PHASE Undo transaction (completion phase) b) FAILURE (abort from any) Participants 19
  • 47. Problems with 2PC Blocking Protocol Risk of indefinite cohort Conservative behaviour blocks if coordinator fails biased to the abort case 20
  • 48. Paxos Algorithm (Consensus) Family of Fault-tolerant, distributed implementations Spectrum of trade-offs: Number of processors Number of message delays Activity level of participants Number of messages sent Types of failures http://www.usenix.org/event/nsdi09/tech/full_papers/yabandeh/yabandeh_html/ http://en.wikipedia.org/wiki/Paxos_algorithm 21
  • 50. ACID & Distributed Systems http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
  • 51. ACID & Distributed Systems ACID properties are always desirable But what about: Latency Partition Tolerance High Availability ? http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
  • 52. CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems - data Consistency, system Availability and tolerance to network Partitions - only two can be achieved at any given moment in time. http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 24
  • 53. CAP Theorem (Brewer’s conjecture) 2000 Prof. Eric Brewer, PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems - data Consistency, system Availability and tolerance to network Partitions - only two can be achieved at any given moment in time. http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 24
  • 54. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
  • 55. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
  • 56. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 CP: requests can complete at nodes that have quorum AP: requests can complete at any live node, possibly violating strong consistency http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
  • 57. Partition Tolerance - Availability “The network will be allowed to lose arbitrarily many messages sent from one node to another” [...] “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” - Gilbert and Lynch, SIGACT 2002 CP: requests can complete at nodes that have quorum HIGH LATENCY AP: requests can complete at any ≈ live node, possiblyPARTITION NETWORK violating strong consistency http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
  • 58. Consistency: Client-side view A service that is consistent operates fully or not at all. Strong consistency (as in ACID) Weak consistency (no guarantee) - Inconsistency window (*) Temporary inconsistencies (e.g. in data constraints or replica versions) are accepted, but they’re resolved at the earliest opportunity http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
  • 59. Consistency: Client-side view A service that is consistent operates fully or not at all. Strong consistency (as in ACID) Weak consistency (no guarantee) - Inconsistency window Eventual* consistency (e.g. DNS) Causal consistency Read-your-writes consistency (the least surprise) Session consistency (*) Temporary inconsistencies (e.g. in data constraints or Monotonic read consistency replica versions) are accepted, but they’re resolved Monotonic write consistency at the earliest opportunity http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
  • 60. Consistency: Server-side (Quorum) N = number of nodes with a replica of the data (*) W = number of replicas that must acknowledge the update R = minimum number of replicas that must participate in a successful read operation (*) but the data will be written to N nodes no matter what W+R>N Strong consistency (usually N=3, W=R=2) W = N, R =1 Optimised for reads W = 1, R = N Optimised for writes (durability not guaranteed in presence of failures) W + R <= N Weak consistency 27
  • 61. Amazon Dynamo Paper Consistent Hashing Vector Clocks Gossip Protocol Hinted Handoffs Read Repair http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf 28
  • 62. Modulo-based Hashing N1 N2 N3 N4 29
  • 63. Modulo-based Hashing N1 N2 N3 N4 ? 29
  • 64. Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers 29
  • 65. Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers - 1) (n_servers 29
  • 66. Modulo-based Hashing N1 N2 N3 N4 partition = key % n_servers - 1) (n_servers Recalculate the hashes for all the entries if n_servers changes (i.e. full data redistribution when adding/removing a node) 29
  • 67. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • 68. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • 69. Consistent Hashing 2160 0 A canonical home (coordinator node) for key range A-B F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • 70. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • 71. Consistent Hashing 2160 0 A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C canonical home for key range A-C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • 72. Consistent Hashing 2160 0 only the keys in this range change location A F B Ring Same hash function E (key space) for data and nodes idx = hash(key) D Coordinator: next C canonical home for key range A-C available clockwise node http://en.wikipedia.org/wiki/Consistent_hashing 30
  • 73. Consistent Hashing - Replication A F B Ring E (key space) D C http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
  • 74. Consistent Hashing - Replication Key hosted AB A in B, C, D F B Data replicated in Ring the N-1 clockwise E (key space) successor nodes D C Node hosting Key , Key , Key FA AB BC http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
  • 75. Consistent Hashing - Node Changes A F B E D C 32
  • 76. Consistent Hashing - Node Changes Key membership A and replicas are updated when a F B node joins or leaves Copy Key the network. Range AB The number of E Copy Key replicas for all data Range FA is kept consistent. D C Copy Key Range EF 32
  • 77. Consistent Hashing - Load Distribution 2160 0 Different Strategies A I Virtual Nodes H B Random tokens per each Ring physical node, partition by C G (key space) token value D Node 1: tokens A, E, G F Node 2: tokens C, F, H E Node 3: tokens B, D, I 33
  • 78. Consistent Hashing - Load Distribution 2160 0 Different Strategies Virtual Nodes Q equal-sized partitions, Ring S nodes, Q/S tokens per (key space) node (with Q >> S) Node 1 Node 2 Node 3 Node 4 ... 34
  • 79. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. Document version history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • 80. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • 81. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document. D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • 82. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document. D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal conflict detected reconciliation handled by A to all update counters in ? V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • 83. Vector Clocks & Conflict Detection A B C write handled by A Causality-based partial order over events that D1 ([A, 1]) happen in the system. write handled by A Document version D2 ([A, 2]) history: a counter for each node that updated write handled by B write handled by C the document. D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in V1 are smaller or equal conflict detected reconciliation handled by A to all update counters in D5 ([A, 3], [B, 1], [C,1]) ? V2, then V1 precedes V2. http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
  • 84. Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the application or the user. The application might resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
  • 85. Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
  • 86. Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by write handled by B un-modified replica checking relative timestamps, or with D3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like merging the changes). Vector clocks can grow quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
  • 87. Vector Clocks & Conflict Detection A B C write handled by A Vector Clocks can detect a conflict. The conflict D1 ([A, 1]) resolution is left to the write handled by A application or the user. The application might D2 ([A, 2]) resolve conflicts by write handled by B un-modified replica checking relative timestamps, or with D3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like merging the changes). version mismatch D3 ⊇ D4, conflict detected resolved automatically Vector clocks can grow D5 ([A, 3], [B, 1]) quite large (!) http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
  • 88. Gossip Protocol + Hinted Handoff A periodic, pairwise, F B inter-process interactions of bounded size E among randomly- chosen peers D C 37
  • 89. Gossip Protocol + Hinted Handoff A I can’t see B, it might be periodic, pairwise, F down but I need some B inter-process ACK. My Merkle Tree root for range XY is interactions of “ab031dab4a385afda” bounded size E among randomly- I can’t see B either. My Merkle Tree root for chosen peers range XY is different! D C B must be down then. Let’s disable it. 37
  • 90. Gossip Protocol + Hinted Handoff My canonical node is supposed to be B. A periodic, pairwise, F B inter-process interactions of bounded size E among randomly- chosen peers D I see. Well, I’ll take care of it for now, and let B know C when B is available again 37
  • 91. Merkle Trees (Hash Trees) Leaves: hashes of ROOT hash(A, B) data blocks. Nodes: hashes of their children. A B hash(C, D) hash(E, F) Used to detect inconsistencies C D E F between replicas hash(001) hash(002) hash(003) hash(004) (anti-entropy) and to minimise the Data Data Data Data Block Block Block Block amount of 001 002 003 004 transferred data http://en.wikipedia.org/wiki/Hash_tree 38
  • 92. Read Repair A F B GET(k, R=2) E D C 39
  • 93. Read Repair A F B GET(k, R=2) E D C 39
  • 94. Read Repair k=XYZ (v.2) A k=XYZ (v.2) F B GET(k, R=2) E D C k=ABC (v.1) 39
  • 95. Read Repair A F B k=XYZ (v.2) E UPDATE(k = XYZ) D C 39
  • 96. NoSQL Break-down Key-value stores, Column Families, Document-oriented dbs, Graph databases 40
  • 97. Focus Of Different Data Models Key-Value Stores Size Column Families Document Databases Graph Databases Complexity http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases 41
  • 98. 1) Key-value stores Amazon Dynamo Paper Data model: collection of key-value pairs 42
  • 99. Voldemort AP LICENSE Dynamo DHT implementation Consistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • 100. Voldemort AP LICENSE Dynamo DHT implementation Consistent hashing,Vector clocks Apache 2 LANGUAGE Java HTTP / Sockets API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • 101. Voldemort AP LICENSE Dynamo DHT implementation Consistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL Conflicts resolved at read HTTP Java and write time Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • 102. Voldemort AP LICENSE Dynamo DHT implementation Consistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Json, Java String, byte[], Avro Protobuf Thrift, Avro, ProtoBuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • 103. Voldemort AP LICENSE Dynamo DHT implementation Consistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC Simple optimistic locking for multi-row updates, pluggable storage engine 43
  • 104. Voldemort AP LICENSE Dynamo DHT implementation Consistent hashing,Vector clocks Apache 2 LANGUAGE Java API/PROTOCOL HTTP Java Thrift Avro Protobuf PERSISTENCE Pluggable BDB/MySQL CONCURRENCY MVCC 43
  • 105. Membase CP LICENSE DHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
  • 106. Membase CP LICENSE DHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space hash function + table lookup http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
  • 107. Membase CP LICENSE DHT (K-V), no SPoF Apache 2 “VBuckets” LANGUAGE C/C++ membase memcached Erlang API/PROTOCOL persistence distributed replication in-memory REST/JSON (fail-over HA) memcached rebalancing Unit of consistency and replication Owner of a subset of the cluster key space hash function + table lookup All metadata kept in memory (high throughput / low latency). Manual/Programmatic failover via the Management REST API. http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
  • 108. Riak AP LICENSE Apache 2 LANGUAGE C, Erlang API/PROTOCOL REST HTTP * ProtoBuf Buckets → K-V “Links” (~relations) Targeted JS Map/Reduce Tune-able consistency (one-quorum-all) 45
  • 109. Redis CP LICENSE K-V store “Data Structures Server” BSD Map, Set, Sorted Set, Linked List LANGUAGE Set/Queue operations, Counters, Pub-Sub, Volatile keys ANSI C API * + PROTOCOL Telnet- like PERSISTENCE 10-100K op/s (whole dataset in RAM + VM) in memory bg snapshots Persistence via snapshotting (tunable fsync freq.) REPLICATION master-slave Distributed if client supports consistent hashing http://redis.io/presentation/Redis_Cluster.pdf 46
  • 110. 2) Column Families Google BigTable paper Data model: big table, column families 47
  • 111. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column column row_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 112. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column column row_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 113. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” <html>... “CNN” “CNN.com” column column column row_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 114. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column column row_key row http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 115. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column column row_key row column family http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 116. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column column row_key row column family ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 117. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column column row_key row column family Atomic updates ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 118. Google BigTable Paper Sparse, distributed, persistent multi-dimensional sorted map indexed by (row_key, column_key, timestamp) CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca” <html>... t3 <html>... t5 “com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8 column column column row_key row column family Atomic updates Automatic GC ACL http://labs.google.com/papers/bigtable-osdi06.pdf 48
  • 119. Google BigTable: Data Structure SSTable Smallest building block Persistent immutable Map[k,v] Operations: lookup by key / key range scan SSTable 64KB 64KB 64KB block block block lookup index 49
  • 120. Google BigTable: Data Structure SSTable Tablet Smallest building block range of rows Dynamically partitioned Persistent immutable Map[k,v] Built from multiple SSTables Operations: lookup and loadkey range scan Unit of distribution by key / balancing Tablet (range Aaa → Bar) SSTable SSTable 64KB 64KB 64KB 64KB 64KB 64KB block block block lookup block block block lookup index index 49
  • 121. Google BigTable: Data Structure SSTable Table Tablet Smallest Tablets (table segments) make up a table Multiple building block Dynamically partitioned range of rows Persistent immutable Map[k,v] Built from multiple SSTables Operations: lookup and loadkey range scan Unit of distribution by key / balancing Table Tablet (range Aaa → Bar) SSTable SSTable 64KB 64KB 64KB 64KB 64KB 64KB block block block lookup block block block lookup index index 49
  • 122. Google BigTable: I/O memtable read memory GFS tablet log SSTable SSTable SSTable write 50
  • 123. Google BigTable: I/O memtable read minor compaction memory GFS tablet log SSTable SSTable SSTable write 50
  • 124. Google BigTable: I/O memtable read minor compaction memory GFS tablet log SSTable SSTable SSTable write BMDiff Zippy 50
  • 125. Google BigTable: I/O memtable read minor compaction memory GFS tablet log SSTable SSTable SSTable write BMDiff Zippy merging / major compaction (GC) 50
  • 126. Google BigTable: Location Dereferencing Metadata Tablets User Tables ... ... Root Tablet Master File ... Chubby ... ... Replicated, persisted Root of the lock service; maintains metadata tree tablet server locations 5 replicas, one elected ... master (via quorum) Up to 3 levels ... Paxos algorithm used in the metadata to keep consistency hierarchy 51
  • 127. Google BigTable: Architecture fs metadata, ACL, GC, load balancing BigTable metadata operations BigTable client master data R/W heartbeat operations messages, GC, chunk migration Tablet Tablet Tablet Chubby Server Server Server track master lock, log of live servers Tablet Tablet Tablet 52
  • 128. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
  • 129. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Java ZooKeeper as API/PROTOCOL coordinator REST HTTP Thrift (instead of Chubby) PERSISTENCE memtable/ SSTable 53
  • 130. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Support for Java multiple masters API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
  • 131. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable HDFS, S3, S3N, EBS (with GZip/LZO CF compression) 53
  • 132. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ Data sorted by key SSTable but evenly distributed across the cluster 53
  • 133. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable Batch Streaming, Map/Reduce 53
  • 134. HBase CP LICENSE OSS implementation of BigTable Apache 2 LANGUAGE Java API/PROTOCOL REST HTTP Thrift PERSISTENCE memtable/ SSTable 53
  • 135. Hypertable CP LICENSE OSS BigTable implementation Faster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift PERSISTENCE memtable/ SSTable CONCURRENCY MVCC HQL (~SQL) 54
  • 136. Hypertable CP LICENSE OSS BigTable implementation Faster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Hyperspace Thrift (paxos) used PERSISTENCE instead of memtable/ SSTable ZooKeeper CONCURRENCY MVCC HQL (~SQL) 54
  • 137. Hypertable CP LICENSE OSS BigTable implementation Faster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift Dynamically PERSISTENCE adapts to memtable/ changes in SSTable workload CONCURRENCY MVCC HQL (~SQL) 54
  • 138. Hypertable CP LICENSE OSS BigTable implementation Faster than HBase (10-30K/s) GPLv2 LANGUAGE C++ API/PROTOCOL C++ Thrift PERSISTENCE memtable/ SSTable CONCURRENCY MVCC HQL (~SQL) 54
  • 139. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Apache 2 LANGUAGE Java PROTOCOL B col_name Thrift Avro col_value PERSISTENCE timestamp memtable/ SSTable Column CONSISTENCY Tunable R/W/N x http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • 140. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Apache 2 LANGUAGE super_column_name Java PROTOCOL B col_name col_name Thrift ... Avro PERSISTENCE col_value col_value timestamp timestamp memtable/ SSTable CONSISTENCY Tunable R/W/N x http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • 141. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Column Family Apache 2 LANGUAGE Java PROTOCOL B col_name col_name Thrift row_key ... Avro PERSISTENCE col_value col_value timestamp timestamp memtable/ SSTable CONSISTENCY Tunable R/W/N x http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
  • 142. Cassandra AP LICENSE Data model of BigTable, infrastructure of Dynamo Super Column Family Apache 2 LANGUAGE super_column_name super_column_name Java PROTOCOL B col_name col_name col_name col_name Thrift row_key ... ... ... Avro col_value col_value col_value col_value PERSISTENCE timestamp timestamp timestamp timestamp memtable/ SSTable CONSISTENCY keyspace.get("column_family",
key,
["super_column",]
"column") Tunable R/W/N x http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. \n
  109. \n
  110. \n
  111. \n
  112. \n
  113. \n
  114. \n
  115. \n
  116. \n
  117. \n
  118. \n
  119. \n
  120. \n
  121. \n
  122. \n
  123. \n
  124. \n
  125. \n
  126. \n
  127. \n
  128. \n
  129. \n
  130. \n
  131. \n
  132. \n
  133. \n
  134. \n
  135. \n
  136. \n
  137. \n
  138. \n
  139. \n
  140. \n
  141. \n
  142. \n
  143. \n
  144. \n
  145. \n
  146. \n
  147. \n
  148. \n
  149. \n
  150. \n
  151. \n
  152. \n
  153. \n
  154. \n
  155. \n
  156. \n
  157. \n
  158. \n
  159. \n
  160. \n
  161. \n
  162. \n
  163. \n
  164. \n
  165. \n
  166. \n
  167. \n
  168. \n
  169. \n
  170. \n
  171. \n
  172. \n
  173. \n
  174. \n
  175. \n
  176. \n
  177. \n
  178. \n
  179. \n
  180. \n
  181. \n
  182. \n
  183. \n
  184. \n
  185. \n
  186. \n
  187. \n
  188. \n
  189. \n
  190. \n
  191. \n
  192. \n
  193. \n
  194. \n
  195. \n
  196. \n
  197. \n
  198. \n
  199. \n
  200. \n
  201. \n
  202. \n
  203. \n
  204. \n
  205. \n
  206. \n
  207. \n
  208. \n
  209. \n
  210. \n
  211. \n
  212. \n
  213. \n
  214. \n
  215. \n
  216. \n
  217. \n
  218. \n
  219. \n
  220. \n
  221. \n
  222. \n
  223. \n
  224. \n
  225. \n
  226. \n
  227. \n
  228. \n
  229. \n
  230. \n
  231. \n
  232. \n
  233. \n
  234. \n
  235. \n
  236. \n
  237. \n
  238. \n
  239. \n
  240. \n
  241. \n
  242. \n
  243. \n
  244. \n
  245. \n
  246. \n
  247. \n
  248. \n
  249. \n
  250. \n
  251. \n
  252. \n
  253. \n
  254. \n
  255. \n
  256. \n
  257. \n
  258. \n
  259. \n
  260. \n
  261. \n
  262. \n
  263. \n
  264. \n
  265. \n
  266. \n
  267. \n
  268. \n
  269. \n
  270. \n
  271. \n
  272. \n
  273. \n
  274. \n
  275. \n
  276. \n
  277. \n
  278. \n
  279. \n
  280. \n
  281. \n
  282. \n
  283. \n
  284. \n
  285. \n
  286. \n
  287. \n
  288. \n
  289. \n
  290. \n
  291. \n
  292. \n
  293. \n
  294. \n
  295. \n
  296. \n
  297. \n
  298. \n
  299. \n
  300. \n
  301. \n
  302. \n
  303. \n
  304. \n
  305. \n
  306. \n
  307. \n
  308. \n
  309. \n
  310. \n
  311. \n
  312. \n
  313. \n
  314. \n
  315. \n
  316. \n
  317. \n