Cassandra is a distributed key-value database inspired by Amazon's Dynamo and Google's Bigtable. It uses a gossip-based protocol for node communication and consistent hashing to partition and replicate data across nodes. Cassandra stores data in memory (memtables) and on disk (SSTables), uses commit logs for crash recovery, and is highly available with tunable consistency.
6. Cassandra – From Dynamo and Bigtable Cassandra is a highly scalable , eventually consistent , distributed , structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google's BigTable . Like Dynamo, Cassandra is eventually consistent . Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems. Cassandra was open sourced by Facebook in 2008, where it was designed by Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik ( Facebook Engineer). In a lot of ways you can think of Cassandra as Dynamo 2.0 or a marriage of Dynamo and BigTable .
7. Cassandra - Overview Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure ; Cassandra does not support a full relational data model ; instead, it provides clients with a simple data model that supports dynamic control over data layout and format .
8. Cassandra - Highlights ● High availability ● Incremental scalability ● Eventually consistent ● Tunable tradeoffs between consistency and latency ● Minimal administration ● No SPF( Single Point of Failure ) .
9. Cassandra – Trade Offs ● No Transactions ● No Adhoc Queries ● No Joins ● No Flexible Indexes Data Modeling with Cassandra Column Families http://www.slideshare.net/gdusbabek/data-modeling-with-cassandra-column-families
10. Cassandra From Dynamo and BigTable Introduction to Cassandra: Replication and Consistency http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency
11. Dynamo-like Features ● Symmetric, P2P Architecture No Special Nodes/SPOFs ● Gossip-based Cluster Management ● Distributed Hash Table for Data Placement Pluggable Partitioning Pluggable Topology Discovery Pluggable Placement Strategies ● Tunable, Eventual Consistency Data Modeling with Cassandra Column Families http://www.slideshare.net/gdusbabek/data-modeling-with-cassandra-column-families
12. BigTable-like Features ● Sparse “Columnar” Data Model Optional, 2-level Maps Called Super Column Families ● SSTable Disk Storage Append-only Commit Log Memtable(buffer and sort) Immutable SSTable Files ● Hadoop Integration Data Modeling with Cassandra Column Families http://www.slideshare.net/gdusbabek/data-modeling-with-cassandra-column-families
13. Brewer's CAP Theorem CAP ( C onsistency, A vailability and P artition Tolerance ) . Pick two of C onsistency, A vailability, P artition tolerance. Theorem: You can have at most two of these properties for any shared-data system. http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
14. ACID & BASE ACID ( A tomicity, C onsistency, I solation, D urability). BASE ( B asically A vailable, S oft-state, E ventually Consistent) ACID: http://en.wikipedia.org/wiki/ACID ACID and BASE: MySQL and NoSQL : http:// www.schoonerinfotech.com/solutions/general/what_is_nosql ACID Strong consistency Isolation Focus on “commit” Nested transactions Availability? Conservative (pessimistic) Difficult evolution (e.g. schema) BASE Weak consistency – stale data OK Availability first Best effort Approximate answers OK Aggressive (optimistic) Simpler! Faster Easier evolution
15. NoSQL The term "NoSQL" was used in 1998 as the name for a lightweight, open source relational database that did not expose a SQL interface. Its author, Carlo Strozzi, claims that as the NoSQL movement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL', or something to that effect.“ CAP BASE E ventual Consistency NoSQL: http://en.wikipedia.org/wiki/NoSQL http://nosql-database.org /
16. Dynamo & Bigtable Dynamo partitioning and replication Log-structured ColumnFamily data model similar to Bigtable's ● Bigtable: A distributed storage system for structured data , 2006 ● Dynamo: amazon's highly available keyvalue store , 2007
17. Dynamo & Bigtable ● BigTable Strong consistency Sparse map data model GFS, Chubby, etc ● Dynamo O(1) distributed hash table (DHT) BASE (eventual consistency) Client tunable consistency/availability
18. Dynamo & Bigtable ● CP Bigtable Hypertable HBase ● AP Dynamo Voldemort Cassandra
20. Dynamo Architecture & Lookup ● O(1) node lookup ● Explicit replication ● Eventually consistent
21. Dynamo Dynamo: a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor .
30. Consistent Hashing - Dynamo Dynamo 把每台 server 分成 v 个虚拟节点,再把所有虚拟节点 (n*v) 随机分配到一致性哈希的圆环上,这样所有的用户从自己圆环上的位置顺时针往下取到第一个 vnode 就是自己所属节点。当此节点存在故障时,再顺时针取下一个作为替代节点。 发生单点故障时负载会均衡分散到其他所有节点,程序实现也比较优雅。
34. Bigtable Tablet 在 BigtableT 中,对表进行切片,一个切片称为 tablet ,保证 100 - 200MB/tablet Column Families the basic unit of access control; All data stored in a column family is usually of the same type (we compress data in the same column family together). Timestamp Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp. Treats data as uninterpreted strings
35. Bigtable: Data Model <Row, Column, Timestamp> triple for key - lookup, insert, and delete API Arbitrary “columns” on a row-by-row basis Column family:qualifier. Family is heavyweight, qualifier lightweight Column-oriented physical store- rows are sparse! Does not support a relational model No table-wide integrity constraints No multirow transactions
36. a three-level hierarchy analogous to that of a B+ tree to store tablet location information Bigtable: Tablet location hierarchy
37. Bigtable: METADATA The first level is a file stored in Chubby that contains the location of the root tablet The root tablet contains the location of all tablets in a special METADATA table The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identier and its end row Each METADATA row stores approximately 1KB of data in memory METADATA table also stores secondary information, including a log of all events pertaining to each tablet (such as when a server begins serving it). This information is helpful for debugging and performance analysis
41. Cassandra – Data Model A table in Cassandra is a distributed multi dimensional map indexed by a key . The value is an object which is highly structured . Every operation under a single row key is atomic per replica no matter how many columns are being read or written into. Columns are grouped together into sets called column families (very much similar to what happens in the Bigtable system. Cassandra exposes two kinds of columns families, Simple and Super column families. Super column families can be visualized as a column family within a column family
42. Cassandra – Data Model Columns are added and modified dynamically KEY ColumnFamily1 Name : MailList Type : Simple Sort : Name Name : tid1 Value : <Binary> TimeStamp : t1 Name : tid2 Value : <Binary> TimeStamp : t2 Name : tid3 Value : <Binary> TimeStamp : t3 Name : tid4 Value : <Binary> TimeStamp : t4 ColumnFamily2 Name : WordList Type : Super Sort : Time Name : aloha ColumnFamily3 Name : System Type : Super Sort : Name Name : hint1 <Column List> Name : hint2 <Column List> Name : hint3 <Column List> Name : hint4 <Column List> C1 V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4 Name : dude C2 V2 T2 C6 V6 T6 Column Families are declared upfront SuperColumns are added and modified dynamically Columns are added and modified dynamically
43. Cassandra – Data Model Keyspace Uppermost namespace Typically one per application ~= database ColumnFamily Associates records of a similar kind not same kind, because CFs are sparse tables Record-level Atomicity Indexed Row each row is uniquely identifiable by key rows group columns and super columns Column Basic unit of storage
49. Cassandra – Data Model - Cluster Cluster > Keyspace Partitioners: OrderPreservingPartitioner RandomPartitioner Like an RDBMS schema: Keyspace per application
50. Cassandra – Data Model Cluster > Keyspace > Column Family Like an RDBMS table: Separates types in an app
51. Cassandra – Data Model SortedMap<Name,Value> ... Cluster > Keyspace > Column Family > Row
52. Cassandra – Data Model Cluster > Keyspace > Column Family > Row > “Column” … Name -> Value byte[] -> byte[] +version timestamp Not like an RDBMS column: Attribute of the row: each row can contain millions of different columns
53. Cassandra – Data Model Any column within a column family is accessed using the convention: column family : column Any column within a column family that is of type super is accessed using the convention: column family :super column : column
55. Storage Model Key (CF1 , CF2 , CF3) Commit Log Binary serialized Key ( CF1 , CF2 , CF3 ) Memtable ( CF1) Memtable ( CF2) Memtable ( CF2) FLUSH Data size Number of Objects Lifetime Dedicated Disk <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> --- --- --- --- <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> BLOCK Index <Key Name> Offset, <Key Name> Offset K 128 Offset K 256 Offset K 384 Offset Bloom Filter (Index in memory) Data file on disk
56. Storage Model-Compactions K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- Sorted MERGE SORT K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Sorted K1 Offset K5 Offset K30 Offset Bloom Filter Loaded in memory Index File Data File D E L E T E D
64. System Architecture Content Overview Partitioning Replication Membership & Failure Detection Bootstrapping Scaling the Cluster Local Persistence Communication
65. System Architecture Tombstones Hinted handoff Read repair Bootstrap Monitoring Admin tools Commit log Memtable SSTable Indexes Compaction Messaging service Gossip Failure detection Cluster state Partitioner Replication Top Layer Middle Layer Core Layer
73. System Architecture The architecture of a storage system needs to have the following characteristics: scalable and robust solutionsfor load balancing membership and failure detection failure recovery replica synchronization overload handling state transfer concurrency and job scheduling request marshalling request routing system monitoring and alarming conguration management
74. System Architecture we will focus on the core distributed systems techniques used in Cassandra: partitioning replication membership Failure handling Scaling All these modules work in synchrony to handle read/write requests
75. System Architecture - Partitioning One of the key design features for Cassandra is the ability to scale incrementally . This requires, the ability to dynamically partition the data over the set of nodes in the cluster. Cassandra partitions data across the cluster using consistent hashing but uses an order preserving hash function to do so. Cassandra uses Consistent-Hashing . The idea is that all the nodes hash-wised are located on a ring . The position of a node on the ring is randomly determined . Each node is responsible for replicated a range of hash function’s output space.
76. System Architecture – Partitioning (Ring Topology) a j g d RF=3 Conceptual Ring One token per node Multiple ranges per node
77. Conceptual Ring One token per node Multiple ranges per node System Architecture – Partitioning (Ring Topology) a j g d RF=2
78. Token assignment Range adjustment Bootstrap Arrival only affects immediate neighbors System Architecture – Partitioning (New Node) a j g d RF=3 m
79. Node dies Available? Hinting Handoff Achtung! Plan for this System Architecture – Partitioning (Ring Partition) a j g d RF=3
83. System Architecture – Partitioning Random system will use MD5 (key) to distribute data across nodes even distribution of keys from one CF across ranges/nodes Order Preserving key distribution determined by token lexicographical ordering can specify the token for this node to use ‘ scrabble’ distribution required for range queries – scan over rows like cursor in index
84. System Architecture – Partitioning - Token A Token is partitioner-dependent element on the Ring . Each Node has a single, unique Token. Each Node claims a Range of the Ring from its Token to the Token of the previous Node on the Ring .
85. System Architecture – Partitioning Map from Key Space to Token RandomPartitioner Tokens are integers in the range [0 .. 2^127] MD5(Key) Token Good: Even Key distribution Bad: Inefficient range queries OrderPreservingPartitioner Tokens are UTF8 strings in the range [“” .. ) Key Token Good: Inefficient range queries Bad: UnEven Key distribution
86. System Architecture – Snitching Map from Nodes to Physical Location EndpointSnitch Guess at rack and DataCenter based on IP address octets DataCenterEndpointSnitch Specify IP subnets for racks, grouped per DataCenter PropertySnitch Specify arbitrary mappings from indivdual IP address to racks and DataCenters
87. System Architecture - Replication Cassandra uses replication to achieve high availability and durability . Each data item is replicated at N hosts, where N is the replication factor configured “per-instance”. Each key,k , is assigned to a coordinator node. The coordinator is in charge of the replication of the data items that fall within its range. In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 nodes in the ring.
88. System Architecture – Placement Map from Token Space to Nodes The first replica is always placed on the node the claims the range in which the token falls Strategies determine where the rest of the replicas are placed Cassandra provides the client with various options for how data needs to be replicated. Cassandra provides various replication policies such as: Rack Unaware Rack Aware (within a datacenter) Datacenter Aware
89. System Architecture - Replication Rack Unaware Place replicas on the N-1 subsequent nodes around the ring, ignoring topology. If certain application chooses “Rack Unaware” replication strategy then the non-coordinator replicas are chosen by picking N-1 successors of the coordinator on the ring.
90. System Architecture - Replication Rack Aware (within a datacenter) Place the second replica in another datacenter, and the remaining N-2 replicas on nodes in other racks in the same datacenter.
91. System Architecture - Replication Datacenter Aware Place M of the N replicas in another datacenter, and the remaining N-M-1 replicas on nodes in other racks in the same datacenter.
93. System Architecture - Replication 1) Every node is aware of every other node in the system and hence the range they are responsible for. This is through Gossiping (not the leader). 2) A key is assigned to a node, that node is the key’s coordinator,who is responsible for replicating the item associated with the key on N-1 replicas in addition to itself. 3) Cassandra offers several replication policies and leaves it up to the application to choose one. These polices differ in the location of the selected Replicas. Rack Aware, Rack Unaware, Datacenter Aware are some of these polices. 4) Whenever a new node joins the system it contacts the Leader of the Cassandra, who tells the node what is the range for which it is responsible for replicating the associated keys. 5) Cassandra uses Zookeeper for maintaining the Leader. 6) The nodes that are responsible for the same range are called “Preference List” for that range. This terminology is borrowed from Dynamo.
95. System Architecture - Replication Replication factor How many nodes data is replicated on Consistency level Zero, One, Quorum, All Sync or async for writes Reliability of reads Read repair
96. System Architecture – Replication(Leader) Cassandra system elects a leader amongst its nodes using a system called Zookeeper. All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for and leader makes a concerted effort to maintain the invariant that no node is responsible for more than N-1 ranges in the ring . The metadata about the ranges a node is responsible is cached locally at each node and in a fault-tolerant manner inside Zookeeper - this way a node that crashes and comes back up knows what ranges it was responsible for. We borrow from Dynamo parlance and deem the nodes that are responsible for a given range the “preference list” for the range.
97. System Architecture - Membership Cluster membership in Cassandra is based on Scuttlebutt, a very ecient anti-entropy Gossip based mechanism.
98. System Architecture - Failure handling Failure detection is a mechanism by which a node can locally determine if any other node in the system is up or down. In Cassandra failure detection is also used to avoid attempts to communicate with unreachable nodes during various operations. Cassandra uses a modied version of the Accrual Failure Detector.
99. System Architecture - Bootstrapping When a node starts for the first time, it chooses a random token for its position in the ring : In Cassandra, joins and leaves of the nodes are initiated using an explicit mechanism, rather than an automatic one . A node is ordered to leave the Network, due to some malfunctioning observed in it. But it should be back soon. If a node leaves the network forever, then data partitioning is required. When a new node joins, data re-partitioning is required. As it frequently happens that the reason for adding a new node, is that some current nodes cannot any more handle all the load on them. So we add a new node and assign part of the range for which some heavily loaded nodes are currently responsible for. In this case data must be transferred between these two Replicas, the old and the new one. This is usually done after the administrator issues a new join. This should not shut the system down for this particular fraction of the range being transferred as there are hopefully other replicas having the same data. Once data is transferred to the new node, then the older node does not have that data any more
100. System Architecture - Scaling When a new node is added into the system, it gets assigned a token such that it can alleviate a heavily loaded node
102. System Architecture - Local Persistence The Cassandra system relies on the local file system for data persistence.
103. System Architecture - Communication Control messages use UDP; Application related messages like read/write requests and replication requests are based on TCP .
105. Cassandra – Read/Write Tunable Consistency - per read/write One - Return once one replica responds success Quorum - Return once RF/2 + 1 replicas respond All - Return when all replicas respond Want async replication? Write = ONE, Read = ONE (Performance++) Want Strong consistency? Read = QUORUM, Write = QUORUM Want Strong Consistency per DataCenter? Read = LOCAL_QUORUM, write LOCAL_QUORUM
106. Cassandra – Read/Write When a read or write request reaches at any node in the cluster the state machine morphs through the following states: The nodes that replicate the data for the key are identified. The request is forwarded to all the nodes and wait on the responses to arrive. if the replies do not arrive within a congured timeout value fail the request and return to the client. If replies received, figure out the latest response based on timestamp. Update replicas with old data(schedule a repair of the data at any replica if they do not have the latest piece of data).
107. Cassandra - Read Repair 每次读取时都读取所有的副本 只返回一个副本的数据 对所有副本应用 Checksum 或 Timestamp 校验 如果存在不一致 取出所有的数据并做合并 将最新的数据写回到不同步( out of sync) 的节点
108. Cassandra - Reads Practically lock free Sstable proliferation New in 0.6: Row cache (avoid sstable lookup, not write-through) Key cache (avoid index scan)
110. Read Query Closest replica Cassandra Cluster Replica A Result Replica B Replica C Result Client Read repair if digests differ Digest Response Digest Query Digest Response
111. Cassandra - Write No reads No seeks Sequential disk access Atomic within a column family Fast Any node Always writeable
112. Cassandra – Write(Properties) No locks in the critical path Sequential disk access Behaves like a write back Cache Append support without read ahead Atomicity guarantee for a key “ Always Writable”(accept writes during failure scenarios)
113. Cassandra - Writes Commit log for durability Configurable fsync Sequential writes only Memtable – no disk access (no reads or seeks) SSTables are final (become read only) Indexes Bloom filter Raw data Bottom line: FAST
115. Cassandra - Write The system can be congured to perform either synchronous or asynchronous writes. For certain systems that require high throughput we rely on asynchronous replication. Here the writes far exceed the reads that come into the system. During the synchronous case we wait for a quorum of responses before we return a result to the client.
117. Cassandra – Write(Fast) fast writes: staged eda A general-purpose framework for high concurrency & load conditioning Decomposes applications into stages separated by queues Adopt a structured approach to event-driven concurrency.
127. Other - DHT DHTs( Distributed hash tables) : A DHT is a class of a decentralized distributed system that provides a lookup service similar to a hash table ; ( key , value ) pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key; DHTs form an infrastructure that can be used to build more complex services, such as anycast , cooperative Web caching , distributed file systems , domain name services , instant messaging , multicast , and also peer-to-peer file sharing and content distribution systems . http://en.wikipedia.org/wiki/Distributed_hash_table
132. Other - Bloom filter An example of a Bloom filter, representing the set { x , y , z }. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3. http://en.wikipedia.org/wiki/Bloom_filter
134. Other - Bloom filter Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk which has slow access times. Bloom filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives). Overall answer speed is better with the Bloom filter than without the Bloom filter. Use of a Bloom filter for this purpose, however, does increase memory usage. 。
135. Other - Timestamps and Vector Clocks Eventual consistency relies on deciding what value a row will eventually converge to; In the case of two writers writing at “the same" time, this is difficult; Timestamps are one solution, but rely on synchronized clocks and don't capture causality; Vector clocks are an alternative method of capturing order in a distributed system.
136. Other - Vector Clocks Definition A vector clock is a tuple {T1, T2, … …, TN} of clock values from each node V1 < V2 if: For all I , V1I <= V2I For at least one I , V1I < V2I V1 < V2 implies global time ordering of events When data is written from node I , it sets TI to its clock value. This allows eventual consistency to resolve consistency between writes on multiple replicas.
Dynamo 中的每个节点就是 Dynamo 的一个成员,亚马逊为了使系统间数据的转发更加迅速(减少数据传送时延,增加响应速度),规定每个成员节点都要保存其他节点的路由信息。由于机器或人为的因素,系统中成员的加入或撤离时常发生。为了保证每个节点保存的都是 Dynamo 中最新的成员信息,所有节点每隔固定时间( 1 秒)就要利用一种类似于 gossip (闲聊)机制 [1] 的方式从其他节点中任意选择一个与之进行通信。连接成功的话双方就交换各自保存的包括存储数据情况、路由信息在内的成员信息
Cluster is a logical storage ring Node placement divides the ring into ranges that represent start/stop points for keys Automatic or manual token assignment (use another slide for that) Closer together means less responsibility and data