Storage cassandra

Contents Overview 1 2 Data Model Storage Model 3 4 System Architecture Read & Write 5 6 Other

Cassandra – From Dynamo and Bigtable Cassandra is a highly scalable , eventually consistent , distributed , structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google's BigTable . Like Dynamo, Cassandra is eventually consistent . Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems. Cassandra was open sourced by Facebook in 2008, where it was designed by Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik ( Facebook Engineer). In a lot of ways you can think of Cassandra as Dynamo 2.0 or a marriage of Dynamo and BigTable .

Cassandra - Overview Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure ; Cassandra does not support a full relational data model ; instead, it provides clients with a simple data model that supports dynamic control over data layout and format .

Cassandra - Highlights ● High availability ● Incremental scalability ● Eventually consistent ● Tunable tradeoffs between consistency and latency ● Minimal administration ● No SPF( Single Point of Failure ) .

Cassandra – Trade Offs ● No Transactions ● No Adhoc Queries ● No Joins ● No Flexible Indexes Data Modeling with Cassandra Column Families http://www.slideshare.net/gdusbabek/data-modeling-with-cassandra-column-families

Cassandra From Dynamo and BigTable Introduction to Cassandra: Replication and Consistency http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency

Dynamo-like Features ● Symmetric, P2P Architecture No Special Nodes/SPOFs ● Gossip-based Cluster Management ● Distributed Hash Table for Data Placement Pluggable Partitioning Pluggable Topology Discovery Pluggable Placement Strategies ● Tunable, Eventual Consistency Data Modeling with Cassandra Column Families http://www.slideshare.net/gdusbabek/data-modeling-with-cassandra-column-families

BigTable-like Features ● Sparse “Columnar” Data Model Optional, 2-level Maps Called Super Column Families ● SSTable Disk Storage Append-only Commit Log Memtable(buffer and sort) Immutable SSTable Files ● Hadoop Integration Data Modeling with Cassandra Column Families http://www.slideshare.net/gdusbabek/data-modeling-with-cassandra-column-families

Brewer's CAP Theorem CAP ( C onsistency, A vailability and P artition Tolerance ) . Pick two of C onsistency, A vailability, P artition tolerance. Theorem: You can have at most two of these properties for any shared-data system. http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

ACID & BASE ACID ( A tomicity, C onsistency, I solation, D urability). BASE ( B asically A vailable, S oft-state, E ventually Consistent) ACID: http://en.wikipedia.org/wiki/ACID ACID and BASE: MySQL and NoSQL : http:// www.schoonerinfotech.com/solutions/general/what_is_nosql ACID Strong consistency Isolation Focus on “commit” Nested transactions Availability? Conservative (pessimistic) Difficult evolution (e.g. schema) BASE Weak consistency – stale data OK Availability first Best effort Approximate answers OK Aggressive (optimistic) Simpler! Faster Easier evolution

NoSQL The term "NoSQL" was used in 1998 as the name for a lightweight, open source relational database that did not expose a SQL interface. Its author, Carlo Strozzi, claims that as the NoSQL movement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL', or something to that effect.“ CAP BASE E ventual Consistency NoSQL: http://en.wikipedia.org/wiki/NoSQL http://nosql-database.org /

Dynamo & Bigtable Dynamo partitioning and replication Log-structured ColumnFamily data model similar to Bigtable's ● Bigtable: A distributed storage system for structured data , 2006 ● Dynamo: amazon's highly available keyvalue store , 2007

Dynamo & Bigtable ● BigTable Strong consistency Sparse map data model GFS, Chubby, etc ● Dynamo O(1) distributed hash table (DHT) BASE (eventual consistency) Client tunable consistency/availability

Dynamo & Bigtable ● CP Bigtable Hypertable HBase ● AP Dynamo Voldemort Cassandra

Dynamo Architecture & Lookup ● O(1) node lookup ● Explicit replication ● Eventually consistent

Dynamo Dynamo: a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor .

Dynamo Techniques Dynamo 架构的主要技术问题采取的相关技术数据均衡分布改进的一致性哈希算法，数据备份数据冲突处理向量时钟（ vector clock ）临时故障处理 Hinted handoff （数据回传机制），参数（ W,R,N ）可调的弱 quorum 机制永久故障后的恢复 Merkle 哈希树成员资格以及错误检测基于 gossip 的成员资格协议和错误检测

Dynamo Techniques Advantages Summary of techniques used in Dynamo and their advantages

Dynamo 数据均衡分布的问题一致性哈希算法优势： -- 负载均衡 -- 屏蔽节点处理能力差异

Dynamo 数据冲突处理最终一致性模型向量时钟（ Vector Clock ）

Dynamo 临时故障处理机制读写参数 W 、 R 、 N N ：系统中每条记录的副本数 W ：每次记录成功写操作需要写入的副本数 R ：每次记录读请求最少需要读取的副本数。满足 R+W>N ，用户即可自行配置 R 和 W 优势：实现可用性与容错性之间的平衡

Dynamo 永久性故障恢复 Merkle 哈希树技术 Dynamo 中 Merkle 哈希树的叶子节点是存储数据所对应的哈希值，父节点是其所有子节点的哈希值

Dynamo 成员资格及错误检测基于 Gossip 协议的成员检测机制

Consistent Hashing - Dynamo Dynamo 把每台 server 分成 v 个虚拟节点，再把所有虚拟节点 (n*v) 随机分配到一致性哈希的圆环上，这样所有的用户从自己圆环上的位置顺时针往下取到第一个 vnode 就是自己所属节点。当此节点存在故障时，再顺时针取下一个作为替代节点。发生单点故障时负载会均衡分散到其他所有节点，程序实现也比较优雅。

Bigtable Tablet 在 BigtableT 中，对表进行切片，一个切片称为 tablet ，保证 100 － 200MB/tablet Column Families the basic unit of access control; All data stored in a column family is usually of the same type (we compress data in the same column family together). Timestamp Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp. Treats data as uninterpreted strings

Bigtable: Data Model <Row, Column, Timestamp> triple for key - lookup, insert, and delete API Arbitrary “columns” on a row-by-row basis Column family:qualifier. Family is heavyweight, qualifier lightweight Column-oriented physical store- rows are sparse! Does not support a relational model No table-wide integrity constraints No multirow transactions

a three-level hierarchy analogous to that of a B+ tree to store tablet location information Bigtable: Tablet location hierarchy

Bigtable: METADATA The first level is a file stored in Chubby that contains the location of the root tablet The root tablet contains the location of all tablets in a special METADATA table The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identier and its end row Each METADATA row stores approximately 1KB of data in memory METADATA table also stores secondary information, including a log of all events pertaining to each tablet (such as when a server begins serving it). This information is helpful for debugging and performance analysis

Bigtable: Tablet Representation

Cassandra – Data Model A table in Cassandra is a distributed multi dimensional map indexed by a key . The value is an object which is highly structured . Every operation under a single row key is atomic per replica no matter how many columns are being read or written into. Columns are grouped together into sets called column families (very much similar to what happens in the Bigtable system. Cassandra exposes two kinds of columns families, Simple and Super column families. Super column families can be visualized as a column family within a column family

Cassandra – Data Model Columns are added and modified dynamically KEY ColumnFamily1 Name : MailList Type : Simple Sort : Name Name : tid1 Value : <Binary> TimeStamp : t1 Name : tid2 Value : <Binary> TimeStamp : t2 Name : tid3 Value : <Binary> TimeStamp : t3 Name : tid4 Value : <Binary> TimeStamp : t4 ColumnFamily2 Name : WordList Type : Super Sort : Time Name : aloha ColumnFamily3 Name : System Type : Super Sort : Name Name : hint1 <Column List> Name : hint2 <Column List> Name : hint3 <Column List> Name : hint4 <Column List> C1 V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4 Name : dude C2 V2 T2 C6 V6 T6 Column Families are declared upfront SuperColumns are added and modified dynamically Columns are added and modified dynamically

Cassandra – Data Model Keyspace Uppermost namespace Typically one per application ~= database ColumnFamily Associates records of a similar kind not same kind, because CFs are sparse tables Record-level Atomicity Indexed Row each row is uniquely identifiable by key rows group columns and super columns Column Basic unit of storage

Cassandra – Data Model(a example)

Cassandra – Data Model http://www.divconq.com/2010/cassandra-columns-and-supercolumns-and-rows/

Cassandra – Data Model - Cluster Cluster

Cassandra – Data Model - Cluster Cluster > Keyspace Partitioners: OrderPreservingPartitioner RandomPartitioner Like an RDBMS schema: Keyspace per application

Cassandra – Data Model Cluster > Keyspace > Column Family Like an RDBMS table: Separates types in an app

Cassandra – Data Model SortedMap<Name,Value> ... Cluster > Keyspace > Column Family > Row

Cassandra – Data Model Cluster > Keyspace > Column Family > Row > “Column” … Name -> Value byte[] -> byte[] +version timestamp Not like an RDBMS column: Attribute of the row: each row can contain millions of different columns

Cassandra – Data Model Any column within a column family is accessed using the convention: column family : column Any column within a column family that is of type super is accessed using the convention: column family :super column : column

Storage Model Key (CF1 , CF2 , CF3) Commit Log Binary serialized Key ( CF1 , CF2 , CF3 ) Memtable ( CF1) Memtable ( CF2) Memtable ( CF2) FLUSH Data size Number of Objects Lifetime Dedicated Disk <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> --- --- --- --- <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> BLOCK Index <Key Name> Offset, <Key Name> Offset K 128 Offset K 256 Offset K 384 Offset Bloom Filter (Index in memory) Data file on disk

Storage Model-Compactions K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- Sorted MERGE SORT K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Sorted K1 Offset K5 Offset K30 Offset Bloom Filter Loaded in memory Index File Data File D E L E T E D

Storage Model - Write 客户端给 Cassandra 集群的任一随机节点发送写请求 " 分割器 " 决定由哪个节点对此数据负责 RandomPartitioner ( 完全按照 Hash 进行分布 ) OrderPreservingPartitioner( 按照数据的原始顺序排序 ) Owner 节点先在本地记录日志 , 然后将其应用到内存副本 (MemTable) 提交日志 (Commit Log) 保存在机器本地的一个独立磁盘上 .

Storage Model - Write 关键路径上没有任何锁顺序磁盘访问表现类似于写入式缓存 (write through cache) 只有 Append 操作 , 没有额外的读开销只保证基于 ColumnFamily 的原子性始终可写 ( 利用 Hinted Handoff) 即使在出现节点故障时仍然可写

Storage Model - Read 从任一节点发起读请求由 " 分割器 " 路由到负责的节点等待 R 个响应在后台等待 N - R 个响应并处理 Read Repair 读取多个 SSTable 读速度比写速度要慢 ( 不过仍然很快 ) 通过使用 BloomFilter 降低检索 SSTable 的次数通过使用 Key/Column index 来提供在 SSTable 检索 Key 以及 Column 的效率可以通过提供更多内存来降低检索时间 / 次数可以扩展到百万级的记录

Cassandra – Storage Cassandra 的存储机制，借鉴了 Bigtable 的设计，采用 Memtable 和 SSTable 的方式。和关系数据库一样， Cassandra 在写数据之前，也需要先记录日志，称之为 commitlog ，然后数据才会写入到 Column Family 对应的 Memtable 中，并且 Memtable 中的内容是按照 key 排序好的。 Memtable 是一种内存结构，满足一定条件后批量刷新到磁盘上，存储为 SSTable 。这种机制，相当于缓存写回机制 (Write-back Cache) ，优势在于将随机 IO 写变成顺序 IO 写，降低大量的写操作对于存储系统的压力。 SSTable 一旦完成写入，就不可变更，只能读取。下一次 Memtable 需要刷新到一个新的 SSTable 文件中。所以对于 Cassandra 来说，可以认为只有顺序写，没有随机写操作。 SSTable: http://wiki.apache.org/cassandra/ArchitectureSSTable

Cassandra – Storage 因为 SSTable 数据不可更新，可能导致同一个 Column Family 的数据存储在多个 SSTable 中，这时查询数据时，需要去合并读取 Column Family 所有的 SSTable 和 Memtable ，这样到一个 Column Family 的数量很大的时候，可能导致查询效率严重下降。因此需要有一种机制能快速定位查询的 Key 落在哪些 SSTable 中，而不需要去读取合并所有的 SSTable 。 Cassandra 采用的是 Bloom Filter 算法，通过多个 hash 函数将 key 映射到一个位图中，来快速判断这个 key 属于哪个 SSTable 。

Cassandra – Storage 为了避免大量 SSTable 带来的性能影响， Cassandra 也提供一种定期将多个 SSTable 合并成一个新的 SSTable 的机制，因为每个 SSTable 中的 key 都是已经排序好的，因此只需要做一次合并排序就可以完成该任务，代价还是可以接受的。所以在 Cassandra 的数据存储目录中，可以看到三种类型的文件，格式类似于： Column Family Name- 序号 -Data.db Column Family Name- 序号 -Filter.db Column Family Name- 序号 -index.db 其中 Data.db 文件是 SSTable 数据文件， SSTable 是 Sorted Strings Table 的缩写，按照 key 排序后存储 key/value 键值字符串。 index.db 是索引文件，保存的是每个 key 在数据文件中的偏移位置，而 Filter.db 则是 Bloom Filter 算法生产的映射文件。。

System Architecture Content Overview Partitioning Replication Membership & Failure Detection Bootstrapping Scaling the Cluster Local Persistence Communication

System Architecture Tombstones Hinted handoff Read repair Bootstrap Monitoring Admin tools Commit log Memtable SSTable Indexes Compaction Messaging service Gossip Failure detection Cluster state Partitioner Replication Top Layer Middle Layer Core Layer

System Architecture Core Layer Middle Layer Top Layer Above the top layer

System Architecture Core Layer: § Messaging Service (async, non-blocking) § Gossip Failure detector § Cluster membership/state § Partitioner(Partitioning scheme) § Replication strategy

System Architecture Middle Layer § Commit log § Memory-table § Compactions § Hinted handoff § Read repair § Bootstrap

System Architecture Top Layer § Key, block, & column indexes § Read consistency § Touch cache § Cassandra API § Admin API § Read Consistency

System Architecture Above the top layer: § Tools § Hadoop integration § Search API and Routing

System Architecture Messaging Layer Cluster Membership Failure Detector Storage Layer Partitioner Replicator Cassandra API Tools

System Architecture The architecture of a storage system needs to have the following characteristics: scalable and robust solutionsfor load balancing membership and failure detection failure recovery replica synchronization overload handling state transfer concurrency and job scheduling request marshalling request routing system monitoring and alarming conguration management

System Architecture we will focus on the core distributed systems techniques used in Cassandra: partitioning replication membership Failure handling Scaling All these modules work in synchrony to handle read/write requests

System Architecture - Partitioning One of the key design features for Cassandra is the ability to scale incrementally . This requires, the ability to dynamically partition the data over the set of nodes in the cluster. Cassandra partitions data across the cluster using consistent hashing but uses an order preserving hash function to do so. Cassandra uses Consistent-Hashing . The idea is that all the nodes hash-wised are located on a ring . The position of a node on the ring is randomly determined . Each node is responsible for replicated a range of hash function’s output space.

System Architecture – Partitioning (Ring Topology) a j g d RF=3 Conceptual Ring One token per node Multiple ranges per node

Conceptual Ring One token per node Multiple ranges per node System Architecture – Partitioning (Ring Topology) a j g d RF=2

Token assignment Range adjustment Bootstrap Arrival only affects immediate neighbors System Architecture – Partitioning (New Node) a j g d RF=3 m

Node dies Available? Hinting Handoff Achtung! Plan for this System Architecture – Partitioning (Ring Partition) a j g d RF=3

System Architecture – Partitioning 在 Cassandra 实际的环境，一个必须要考虑的关键问题是 Token 的选择。 Token 决定了每个节点存储的数据的分布范围，每个节点保存的数据的 key 在 ( 前一个节点 Token ，本节点 Token] 的半开半闭区间内，所有的节点形成一个首尾相接的环，所以第一个节点保存的是大于最大 Token 小于等于最小 Token 之间的数据 ; 根据采用的分区策略的不同， Token 的类型和设置原则也有所不同。 Cassandra (0.6 版本 ) 本身支持三种分区策略： RandomPartitioner OrderPreservingPartitioner CollatingOrderPreservingPartitioner

System Architecture – Partitioning RandomPartitioner ：随机分区是一种 hash 分区策略，使用的 Token 是大整数型 (BigInteger) ，范围为 [0 ~ 2^127] ，因此极端情况下，一个采用随机分区策略的 Cassandra 集群的节点可以达到 (2^127 + 1) 个节点。 Cassandra 采用了 MD5 作为 hash 函数，其结果是 128 位的整数值 ( 其中一位是符号位， Token 取绝对值为结果 ) 。采用随机分区策略的集群无法支持针对 Key 的范围查询。假如集群有 N 个节点，每个节点的 hash 空间采取平均分布的话，那么第 i 个节点的 Token 可以设置为： i * ( 2 ^ 127 / N )

System Architecture – Partitioning OrderPreservingPartitioner ：如果要支持针对 Key 的范围查询，那么可以选择这种有序分区策略。该策略采用的是字符串类型的 Token 。每个节点的具体选择需要根据 Key 的情况来确定。如果没有指定 InitialToken ，则系统会使用一个长度为 16 的随机字符串作为 Token ，字符串包含大小写字符和数字。 CollatingOrderPreservingPartitioner ：和 OrderPreservingPartitioner 一样是有序分区策略。只是排序的方式不一样，采用的是字节型 Token ，支持设置不同语言环境的排序方式，代码中默认是 en_US 。

System Architecture – Partitioning Random system will use MD5 (key) to distribute data across nodes even distribution of keys from one CF across ranges/nodes Order Preserving key distribution determined by token lexicographical ordering can specify the token for this node to use ‘ scrabble’ distribution required for range queries – scan over rows like cursor in index

System Architecture – Partitioning - Token A Token is partitioner-dependent element on the Ring . Each Node has a single, unique Token. Each Node claims a Range of the Ring from its Token to the Token of the previous Node on the Ring .

System Architecture – Partitioning Map from Key Space to Token RandomPartitioner Tokens are integers in the range [0 .. 2^127] MD5(Key)  Token Good: Even Key distribution Bad: Inefficient range queries OrderPreservingPartitioner Tokens are UTF8 strings in the range [“” .. ) Key  Token Good: Inefficient range queries Bad: UnEven Key distribution

System Architecture – Snitching Map from Nodes to Physical Location EndpointSnitch Guess at rack and DataCenter based on IP address octets DataCenterEndpointSnitch Specify IP subnets for racks, grouped per DataCenter PropertySnitch Specify arbitrary mappings from indivdual IP address to racks and DataCenters

System Architecture - Replication Cassandra uses replication to achieve high availability and durability . Each data item is replicated at N hosts, where N is the replication factor configured “per-instance”. Each key,k , is assigned to a coordinator node. The coordinator is in charge of the replication of the data items that fall within its range. In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 nodes in the ring.

System Architecture – Placement Map from Token Space to Nodes The first replica is always placed on the node the claims the range in which the token falls Strategies determine where the rest of the replicas are placed Cassandra provides the client with various options for how data needs to be replicated. Cassandra provides various replication policies such as: Rack Unaware Rack Aware (within a datacenter) Datacenter Aware

System Architecture - Replication Rack Unaware Place replicas on the N-1 subsequent nodes around the ring, ignoring topology. If certain application chooses “Rack Unaware” replication strategy then the non-coordinator replicas are chosen by picking N-1 successors of the coordinator on the ring.

System Architecture - Replication Rack Aware (within a datacenter) Place the second replica in another datacenter, and the remaining N-2 replicas on nodes in other racks in the same datacenter.

System Architecture - Replication Datacenter Aware Place M of the N replicas in another datacenter, and the remaining N-M-1 replicas on nodes in other racks in the same datacenter.

System Architecture – Partitioning

System Architecture - Replication 1) Every node is aware of every other node in the system and hence the range they are responsible for. This is through Gossiping (not the leader). 2) A key is assigned to a node, that node is the key’s coordinator,who is responsible for replicating the item associated with the key on N-1 replicas in addition to itself. 3) Cassandra offers several replication policies and leaves it up to the application to choose one. These polices differ in the location of the selected Replicas. Rack Aware, Rack Unaware, Datacenter Aware are some of these polices. 4) Whenever a new node joins the system it contacts the Leader of the Cassandra, who tells the node what is the range for which it is responsible for replicating the associated keys. 5) Cassandra uses Zookeeper for maintaining the Leader. 6) The nodes that are responsible for the same range are called “Preference List” for that range. This terminology is borrowed from Dynamo.

System Architecture – Replication

System Architecture - Replication Replication factor How many nodes data is replicated on Consistency level Zero, One, Quorum, All Sync or async for writes Reliability of reads Read repair

System Architecture – Replication(Leader) Cassandra system elects a leader amongst its nodes using a system called Zookeeper. All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for and leader makes a concerted effort to maintain the invariant that no node is responsible for more than N-1 ranges in the ring . The metadata about the ranges a node is responsible is cached locally at each node and in a fault-tolerant manner inside Zookeeper - this way a node that crashes and comes back up knows what ranges it was responsible for. We borrow from Dynamo parlance and deem the nodes that are responsible for a given range the “preference list” for the range.

System Architecture - Membership Cluster membership in Cassandra is based on Scuttlebutt, a very ecient anti-entropy Gossip based mechanism.

System Architecture - Failure handling Failure detection is a mechanism by which a node can locally determine if any other node in the system is up or down. In Cassandra failure detection is also used to avoid attempts to communicate with unreachable nodes during various operations. Cassandra uses a modied version of the Accrual Failure Detector.

System Architecture - Bootstrapping When a node starts for the first time, it chooses a random token for its position in the ring : In Cassandra, joins and leaves of the nodes are initiated using an explicit mechanism, rather than an automatic one . A node is ordered to leave the Network, due to some malfunctioning observed in it. But it should be back soon. If a node leaves the network forever, then data partitioning is required. When a new node joins, data re-partitioning is required. As it frequently happens that the reason for adding a new node, is that some current nodes cannot any more handle all the load on them. So we add a new node and assign part of the range for which some heavily loaded nodes are currently responsible for. In this case data must be transferred between these two Replicas, the old and the new one. This is usually done after the administrator issues a new join. This should not shut the system down for this particular fraction of the range being transferred as there are hopefully other replicas having the same data. Once data is transferred to the new node, then the older node does not have that data any more

System Architecture - Scaling When a new node is added into the system, it gets assigned a token such that it can alleviate a heavily loaded node

System Architecture - Scaling

System Architecture - Local Persistence The Cassandra system relies on the local file system for data persistence.

System Architecture - Communication Control messages use UDP; Application related messages like read/write requests and replication requests are based on TCP .

Cassandra – Read/Write Tunable Consistency - per read/write One - Return once one replica responds success Quorum - Return once RF/2 + 1 replicas respond All - Return when all replicas respond Want async replication? Write = ONE, Read = ONE (Performance++) Want Strong consistency? Read = QUORUM, Write = QUORUM Want Strong Consistency per DataCenter? Read = LOCAL_QUORUM, write LOCAL_QUORUM

Cassandra – Read/Write When a read or write request reaches at any node in the cluster the state machine morphs through the following states: The nodes that replicate the data for the key are identified. The request is forwarded to all the nodes and wait on the responses to arrive. if the replies do not arrive within a congured timeout value fail the request and return to the client. If replies received, figure out the latest response based on timestamp. Update replicas with old data(schedule a repair of the data at any replica if they do not have the latest piece of data).

Cassandra - Read Repair 每次读取时都读取所有的副本只返回一个副本的数据对所有副本应用 Checksum 或 Timestamp 校验如果存在不一致取出所有的数据并做合并将最新的数据写回到不同步（ out of sync) 的节点

Cassandra - Reads Practically lock free Sstable proliferation New in 0.6: Row cache (avoid sstable lookup, not write-through) Key cache (avoid index scan)

Cassandra - Read Any node Read repair Usual caching conventions apply

Read Query Closest replica Cassandra Cluster Replica A Result Replica B Replica C Result Client Read repair if digests differ Digest Response Digest Query Digest Response

Cassandra - Write No reads No seeks Sequential disk access Atomic within a column family Fast Any node Always writeable

Cassandra – Write(Properties) No locks in the critical path Sequential disk access Behaves like a write back Cache Append support without read ahead Atomicity guarantee for a key “ Always Writable”(accept writes during failure scenarios)

Cassandra - Writes Commit log for durability Configurable fsync Sequential writes only Memtable – no disk access (no reads or seeks) SSTables are final (become read only) Indexes Bloom filter Raw data Bottom line: FAST

Cassandra - Write The system can be congured to perform either synchronous or asynchronous writes. For certain systems that require high throughput we rely on asynchronous replication. Here the writes far exceed the reads that come into the system. During the synchronous case we wait for a quorum of responses before we return a result to the client.

Cassandra – Write(Fast) fast writes: staged eda A general-purpose framework for high concurrency & load conditioning Decomposes applications into stages separated by queues Adopt a structured approach to event-driven concurrency.

Cassandra – Write(Compactions)

Cassandra – Gossip Cassandra 是一个有单个节点组成的集群 – 其中没有“主”节点或单点故障 - 因此，每个节点都必须积极地确认集群中其他节点的状态。它们使用一个称为闲话（ Gossip ）的机制来做此事 . 每个节点每秒中都会将集群中每个节点的状态“以闲话的方式传播”到 1-3 个其他节点 . 系统为闲话数据添加了版本 , 因此一个节点的任何变更都会快速地传播遍整个集群 . 通过这种方式 , 每个节点都能知道任一其他节点的当前状态 : 是在正在自举呢 , 还是正常运行呢 , 等。

Cassandra – Hinted Handoff Cassandra 会存储数据的拷贝到 N 个节点 . 客户端可以根据数据的重要性选择一个一致性级别 (Consistency level), 例如 , QUORUM 表示 , 只有这 N 个节点中的多数返回成功才表示这个写操作成功。如果这些节点中的一个宕机了 , 会发生什么呢 ? 写操作稍后将如何传递到此节点呢 ? Cassandra 使用了提示移交 (Hinted Handoff) 的技术来解决此问题 , 其中数据会被写入并保存到另一个随机节点 X, 并提示这些数据需要被保存到节点 Y, 并在节点重新在线时进行重放 ( 记住 , 当节点 Y 变成在线时 , 闲话机制会快速通知 X 节点 ). 提示移交可以确保节点 Y 可以快速的匹配上集群中的其他节点 . 注意 , 如果提示移交由于某种原因没有起作用 , 读修复最终仍然会“修复”这些过期数据，不过只有当客户端访问这些数据时才会进行读修复。提示的写是不可读的 ( 因为节点 X 并不是这 N 份拷贝的其中一个正式节点 ), 因此 , 它们并不会记入写一致性 . 如果 Cassandra 的配置了 3 份拷贝 , 而其中的两个节点不可用 , 就不可能实现一个 QUORUM 的写操作。

Cassandra – Anti-entropy Cassandra 的一个众所周知的秘密武器是逆熵 (Anti-entropy). 逆熵明确保证集群中的节点一致认可当前数据 . 如果由于默认情况 , 读修复 (read repair) 与提示移交 (hinted handoff) 都没有生效 , 逆熵会确保节点达到最终一致性 . 逆熵服务是在“主压缩” ( 等价与关系数据库中的重建表 ) 时运行的 , 因此，它是一个相对重量级但运行不频繁的进程 . 逆熵使用 Merkle 树 ( 也称为散列树 ) 来确定节点在列族 (column family) 数据树内的什么位置不能一致认可 , 接着修复该位置的每一个分支。

Other - DHT DHTs( Distributed hash tables) : A DHT is a class of a decentralized distributed system that provides a lookup service similar to a hash table ; ( key , value ) pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key; DHTs form an infrastructure that can be used to build more complex services, such as anycast , cooperative Web caching , distributed file systems , domain name services , instant messaging , multicast , and also peer-to-peer file sharing and content distribution systems . http://en.wikipedia.org/wiki/Distributed_hash_table

Other - Cassandra - Domain Models

Other - Bloom filter An example of a Bloom filter, representing the set { x , y , z }. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3. http://en.wikipedia.org/wiki/Bloom_filter

Other - Bloom filter Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk which has slow access times. Bloom filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives). Overall answer speed is better with the Bloom filter than without the Bloom filter. Use of a Bloom filter for this purpose, however, does increase memory usage. 。

Other - Timestamps and Vector Clocks Eventual consistency relies on deciding what value a row will eventually converge to; In the case of two writers writing at “the same" time, this is difficult; Timestamps are one solution, but rely on synchronized clocks and don't capture causality; Vector clocks are an alternative method of capturing order in a distributed system.

Other - Vector Clocks Definition A vector clock is a tuple {T1, T2, … …, TN} of clock values from each node V1 < V2 if: For all I , V1I <= V2I For at least one I , V1I < V2I V1 < V2 implies global time ordering of events When data is written from node I , it sets TI to its clock value. This allows eventual consistency to resolve consistency between writes on multiple replicas.

Other - CommitLog 和关系型数据库系统一样， Cassandra 也是采用的先写日志再写数据的方式，其日志称之为 Commitlog 。和 Memtable/SSTable 不一样的是， Commitlog 是 server 级别的，不是 Column Family 级别的。每个 Commitlog 文件的大小是固定的，称之为一个 Commitlog Segment ，目前版本 (0.5.1) 中，这个大小是 128MB ，这是硬编码在代码 (srcavargpacheassandra bommitlog.java) 中的。当一个 Commitlog 文件写满以后，会新建一个的文件。当旧的 Commitlog 文件不再需要时，会自动清除 .

Other - CommitLog 每个 Commitlog 文件 (Segment) 都有一个固定大小（大小根据 Column Family 的数目而定）的 CommitlogHeader 结构，其中有两个重要的数组，每一个 Column Family 在这两个数组中都存在一个对应的元素。其中一个是位图数组 ( BitSet dirty ) ，如果 Column Family 对应的 Memtable 中有脏数据，则置为 1 ，否则为 0 ，这在恢复的时候可以指出哪些 Column Family 是需要利用 Commitlog 进行恢复的。另外一个是整数数组 ( int[] lastFlushedAt ) ，保存的是 Column Family 在上一次 Flush 时日志的偏移位置，恢复时则可以从这个位置读取 Commitlog 记录。通过这两个数组结构， Cassandra 可以在异常重启服务的时候根据持久化的 SSTable 和 Commitlog 重构内存中 Memtable 的内容，也就是类似 Oracle 等关系型数据库的实例恢复 .

Other - CommitLog 当 Memtable flush 到磁盘的 SStable 时，会将所有 Commitlog 文件的 dirty 数组对应的位清零，而在 Commitlog 达到大小限制创建新的文件时， dirty 数组会从上一个文件中继承过来。如果一个 Commitlog 文件的 dirty 数组全部被清零，则表示这个 Commitlog 在恢复的时候不再需要，可以被清除。因此，在恢复的时候，所有的磁盘上存在的 Commitlog 文件都是需要的 . http://wiki.apache.org/cassandra/ArchitectureCommitLog http://www.ningoo.net/html/2010/cassandra_commitlog.html

Storage cassandra

More Related Content

Storage cassandra

Editor's Notes