Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
ScyllaDB Tablets:
Rethinking Replication
Avi Kivity, CTO @ ScyllaDB
Avi Kivity
■ CTO, co-founder ScyllaDB
■ Linux KVM author and ex-maintainer
■ Goals
■ Short Demo
■ The problem we’re fixing
■ Implementation details
■ Deep dive into Demo
■ Status
Agenda
Tablet project goals
Goals
■ Fast bootstrap/decommission
■ Bootstrap is a critical time for a cluster if running out of space or CPU capacity
■ Incremental bootstrap
■ Shoulder the load immediately, not after bootstrap completes
■ Parallel bootstrap
■ Add multiple nodes in parallel if we’re in a real hurry
■ Decouple topology operations
■ Remove a dead node while bootstrapping two new nodes
■ Improve support for many small tables
Demo
■ Scenario
■ Preload a 3-node cluster with 650 GB/replica
■ Run a moderate mixed read/write workload
■ Bootstrap three nodes
■ Decommission three nodes
■ Test harness
■ Scylla-cluster-tests (open-source)
■ Used in weekly regression tests
Demo Scenario
Replica Writes
Replica Reads
History
How did we get here?
■ ScyllaDB streaming was fast… but
node storage grew faster
■ Some schema shapes slow down
mutation-based streaming
■ Eventually consistent, masterless
architecture means the operator has
to coordinate everything …
■ … and everything is serialized
■ Layout is static (famous Token Ring)
A
B
A
B
Token ring
For 2-node
cluster
After adding
one node
A
B
A
B
C
C
Implementation
■ Introduce a new layer of indirection -
the tablets table
■ Each table has its own token range to node
mapping
■ Mapping can change independently of node
addition and removal
■ A new node can be added without any owned
data!
■ Different tables can have different tablet counts
■ Tablet counts change so tablet size on disk
remains roughly constant
■ Managed by Raft Group 0
Implementation - metadata
System, tablets
Query
Replica
Set
Token
■ Each tablet replica is isolated into its
own memtable+sstables
■ Forms its own little Log-Structured Merge Tree
■ With compaction and stuff
■ Can be migrated as a unit
■ Migration: copy the unit
■ Cleanup: delete the unit
■ Split/merge as the table
grows/shrinks
Implementation - data path
■ Hosted on one node
■ But can be migrated freely if the node is down
■ Synchronized via Raft
■ Collects statistics on tables and
tablets
■ Migrates to balance space
■ Evacuates nodes to decommission
■ Migrates to balance CPU load
■ Rebuilds and repairs
Implementation - load balancer
■ Source of truth for the cluster
■ How many tablets for each table
■ Token boundaries for each tablets
■ On which nodes and shards do we have
replicas
■ What kind of transition the tablet is
undergoing
■ Which nodes and shards will host tablet
replicas after the transition
■ Managed using the Raft protocol
■ Replicated on every node
The tablets table
CREATE TABLE system.tablets (
table_id uuid,
last_token bigint,
keyspace_name text static,
replicas list<tuple<uuid, int>>,
new_replicas list<tuple<uuid, int>>,
session uuid,
stage text,
transition text,
table_name text static,
tablet_count int static,
PRIMARY KEY (table_id, last_token)
);
■ Drivers are tablet-aware
■ But without reading the tablets
table
■ Driver contacts a random node/shard
■ On miss, gets updated routing
information for that tablet
■ Ensures fast start-up even with 100,000
entries in the tablets table
Driver support
Deeper dive into demo
Replica Reads/Writes
Coordinator requests
CPU load (per shard)
Disk bandwidth - streaming
Disk bandwidth -query
Disk bandwidth - commitlog
Visualization tool
Status
■ What works
■ Parallel, incremental bootstrap and
decommission
■ File streaming
■ Replace and rebuild
■ Space load balancing
■ Split on table growth
■ Materialized Views
■ Still working on it
■ Merge on table shrinkage
■ Change Data Capture
■ CPU load balancing
■ Automatic migration from Vnodes
Status/plans
Stay in Touch
Avi Kivity
avi@scylladb.com
@AviKivity
@avikivity
https://www.linkedin.com/in/avikivity/
Graphics Sandbox
Illustration - request looks up into a table of tablets, exits lookup with (node,
shard) routing information
Where is
this tablet?
Node B
Shard 3
system.tablets
Query
Replica
Set
(detailed, vnodeS)
(detailed, tablets)
(detailed, vnodes, per-shard)
(detailed, tablets, per-shard)
Vnodes, advanced
Tablets, advanced
A node in which there are 16 equally sized slots for tablets, only some are
filled. In each filled slot there is a memtable + a few sstables
Memtable SSTable
SSTable
Memtable SSTable
SSTable
SSTable
Memtable SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
Memtable SSTable
SSTable
Memtable SSTable
Memtable SSTable
SSTable
SSTable
Memtable SSTable
Memtable SSTable
SSTable
SSTable
SSTable
Memtable SSTable
SSTable
Memtable SSTable
Illustration - maybe an octopus spreading its
tentacles to each node
Maybe use DB Performance at
Scale book style tentacles here?
Illustration - Deuteronomy style tablets but with host/shard
information instead of commandments
Felipe's replacement idea for deuteronomy thingie
CREATE TABLE system.tablets (
keyspace_name text static,
table_name text static,
tablet_count int static,
table_id uuid,
last_token bigint,
…
PRIMARY KEY(table_id, last_token)
)
Raft-Managed Table
User Table
Replicas

More Related Content

Similar to ScyllaDB Tablets: Rethinking Replication

W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in database
gafurov_x
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
Keeyong Han
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
 
How to be Successful with Scylla
How to be Successful with ScyllaHow to be Successful with Scylla
How to be Successful with Scylla
ScyllaDB
 
Introduction to ClustrixDB
Introduction to ClustrixDBIntroduction to ClustrixDB
Introduction to ClustrixDB
I Goo Lee
 
The Path to ScyllaDB 5.2
The Path to ScyllaDB 5.2The Path to ScyllaDB 5.2
The Path to ScyllaDB 5.2
ScyllaDB
 
To Serverless and Beyond
To Serverless and BeyondTo Serverless and Beyond
To Serverless and Beyond
ScyllaDB
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
Altinity Ltd
 
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDBScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
Colin Charles
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performance
solarisyougood
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
ScyllaDB
 
Mass storage systemsos
Mass storage systemsosMass storage systemsos
Mass storage systemsos
Gokila Manickam
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
Ceph Community
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
percona2013
 
Functional?
 Reactive? 
Why?
Functional?
 Reactive? 
Why?Functional?
 Reactive? 
Why?
Functional?
 Reactive? 
Why?
Timetrix
 
PNUTS: Yahoo!’s Hosted Data Serving Platform
PNUTS: Yahoo!’s Hosted Data Serving PlatformPNUTS: Yahoo!’s Hosted Data Serving Platform
PNUTS: Yahoo!’s Hosted Data Serving Platform
Tarik Reza Toha
 
OSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin CharlesOSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin Charles
NETWAYS
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
jaxconf
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
shinolajla
 

Similar to ScyllaDB Tablets: Rethinking Replication (20)

W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in database
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
How to be Successful with Scylla
How to be Successful with ScyllaHow to be Successful with Scylla
How to be Successful with Scylla
 
Introduction to ClustrixDB
Introduction to ClustrixDBIntroduction to ClustrixDB
Introduction to ClustrixDB
 
The Path to ScyllaDB 5.2
The Path to ScyllaDB 5.2The Path to ScyllaDB 5.2
The Path to ScyllaDB 5.2
 
To Serverless and Beyond
To Serverless and BeyondTo Serverless and Beyond
To Serverless and Beyond
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDBScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performance
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
 
Mass storage systemsos
Mass storage systemsosMass storage systemsos
Mass storage systemsos
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
 
Functional?
 Reactive? 
Why?
Functional?
 Reactive? 
Why?Functional?
 Reactive? 
Why?
Functional?
 Reactive? 
Why?
 
PNUTS: Yahoo!’s Hosted Data Serving Platform
PNUTS: Yahoo!’s Hosted Data Serving PlatformPNUTS: Yahoo!’s Hosted Data Serving Platform
PNUTS: Yahoo!’s Hosted Data Serving Platform
 
OSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin CharlesOSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin Charles
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 

More from ScyllaDB

Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
ScyllaDB
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
ScyllaDB
 
Noise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, AkamaiNoise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, Akamai
ScyllaDB
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
ScyllaDB
 
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
ScyllaDB
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
ScyllaDB
 
Using Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance TroublesUsing Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance Troubles
ScyllaDB
 
Reducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGCReducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGC
ScyllaDB
 
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
ScyllaDB
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
ScyllaDB
 
Conquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB DriversConquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB Drivers
ScyllaDB
 
Interaction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance MetricInteraction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance Metric
ScyllaDB
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
ScyllaDB
 
99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz
ScyllaDB
 
Square's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with RaftSquare's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with Raft
ScyllaDB
 
Making Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of RustMaking Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of Rust
ScyllaDB
 
A Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus AlbuquerqueA Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus Albuquerque
ScyllaDB
 
The Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of LatencyThe Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of Latency
ScyllaDB
 

More from ScyllaDB (20)

Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
 
Noise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, AkamaiNoise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, Akamai
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
 
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
 
Using Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance TroublesUsing Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance Troubles
 
Reducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGCReducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGC
 
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
 
Conquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB DriversConquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB Drivers
 
Interaction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance MetricInteraction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance Metric
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
 
99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz
 
Square's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with RaftSquare's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with Raft
 
Making Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of RustMaking Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of Rust
 
A Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus AlbuquerqueA Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus Albuquerque
 
The Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of LatencyThe Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of Latency
 

Recently uploaded

Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
 
Navigating Post-Quantum Blockchain: Resilient Cryptography in Quantum Threats
Navigating Post-Quantum Blockchain: Resilient Cryptography in Quantum ThreatsNavigating Post-Quantum Blockchain: Resilient Cryptography in Quantum Threats
Navigating Post-Quantum Blockchain: Resilient Cryptography in Quantum Threats
anupriti
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx
SATYENDRA100
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
Lessons Of Binary Analysis - Christien Rioux
Lessons Of Binary Analysis - Christien RiouxLessons Of Binary Analysis - Christien Rioux
Lessons Of Binary Analysis - Christien Rioux
crioux1
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
The Digital Insurer
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
James Anderson
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
amitchopra0215
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)
Alpen-Adria-Universität
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
Linda Zhang
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
Larry Smarr
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 

Recently uploaded (20)

Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
 
Navigating Post-Quantum Blockchain: Resilient Cryptography in Quantum Threats
Navigating Post-Quantum Blockchain: Resilient Cryptography in Quantum ThreatsNavigating Post-Quantum Blockchain: Resilient Cryptography in Quantum Threats
Navigating Post-Quantum Blockchain: Resilient Cryptography in Quantum Threats
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
Lessons Of Binary Analysis - Christien Rioux
Lessons Of Binary Analysis - Christien RiouxLessons Of Binary Analysis - Christien Rioux
Lessons Of Binary Analysis - Christien Rioux
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 

ScyllaDB Tablets: Rethinking Replication

  • 2. Avi Kivity ■ CTO, co-founder ScyllaDB ■ Linux KVM author and ex-maintainer
  • 3. ■ Goals ■ Short Demo ■ The problem we’re fixing ■ Implementation details ■ Deep dive into Demo ■ Status Agenda
  • 5. Goals ■ Fast bootstrap/decommission ■ Bootstrap is a critical time for a cluster if running out of space or CPU capacity ■ Incremental bootstrap ■ Shoulder the load immediately, not after bootstrap completes ■ Parallel bootstrap ■ Add multiple nodes in parallel if we’re in a real hurry ■ Decouple topology operations ■ Remove a dead node while bootstrapping two new nodes ■ Improve support for many small tables
  • 7. ■ Scenario ■ Preload a 3-node cluster with 650 GB/replica ■ Run a moderate mixed read/write workload ■ Bootstrap three nodes ■ Decommission three nodes ■ Test harness ■ Scylla-cluster-tests (open-source) ■ Used in weekly regression tests Demo Scenario
  • 11. How did we get here? ■ ScyllaDB streaming was fast… but node storage grew faster ■ Some schema shapes slow down mutation-based streaming ■ Eventually consistent, masterless architecture means the operator has to coordinate everything … ■ … and everything is serialized ■ Layout is static (famous Token Ring) A B A B Token ring For 2-node cluster After adding one node A B A B C C
  • 13. ■ Introduce a new layer of indirection - the tablets table ■ Each table has its own token range to node mapping ■ Mapping can change independently of node addition and removal ■ A new node can be added without any owned data! ■ Different tables can have different tablet counts ■ Tablet counts change so tablet size on disk remains roughly constant ■ Managed by Raft Group 0 Implementation - metadata System, tablets Query Replica Set Token
  • 14. ■ Each tablet replica is isolated into its own memtable+sstables ■ Forms its own little Log-Structured Merge Tree ■ With compaction and stuff ■ Can be migrated as a unit ■ Migration: copy the unit ■ Cleanup: delete the unit ■ Split/merge as the table grows/shrinks Implementation - data path
  • 15. ■ Hosted on one node ■ But can be migrated freely if the node is down ■ Synchronized via Raft ■ Collects statistics on tables and tablets ■ Migrates to balance space ■ Evacuates nodes to decommission ■ Migrates to balance CPU load ■ Rebuilds and repairs Implementation - load balancer
  • 16. ■ Source of truth for the cluster ■ How many tablets for each table ■ Token boundaries for each tablets ■ On which nodes and shards do we have replicas ■ What kind of transition the tablet is undergoing ■ Which nodes and shards will host tablet replicas after the transition ■ Managed using the Raft protocol ■ Replicated on every node The tablets table CREATE TABLE system.tablets ( table_id uuid, last_token bigint, keyspace_name text static, replicas list<tuple<uuid, int>>, new_replicas list<tuple<uuid, int>>, session uuid, stage text, transition text, table_name text static, tablet_count int static, PRIMARY KEY (table_id, last_token) );
  • 17. ■ Drivers are tablet-aware ■ But without reading the tablets table ■ Driver contacts a random node/shard ■ On miss, gets updated routing information for that tablet ■ Ensures fast start-up even with 100,000 entries in the tablets table Driver support
  • 21. CPU load (per shard)
  • 22. Disk bandwidth - streaming
  • 24. Disk bandwidth - commitlog
  • 27. ■ What works ■ Parallel, incremental bootstrap and decommission ■ File streaming ■ Replace and rebuild ■ Space load balancing ■ Split on table growth ■ Materialized Views ■ Still working on it ■ Merge on table shrinkage ■ Change Data Capture ■ CPU load balancing ■ Automatic migration from Vnodes Status/plans
  • 28. Stay in Touch Avi Kivity avi@scylladb.com @AviKivity @avikivity https://www.linkedin.com/in/avikivity/
  • 30. Illustration - request looks up into a table of tablets, exits lookup with (node, shard) routing information Where is this tablet? Node B Shard 3 system.tablets Query Replica Set
  • 37. A node in which there are 16 equally sized slots for tablets, only some are filled. In each filled slot there is a memtable + a few sstables Memtable SSTable SSTable Memtable SSTable SSTable SSTable Memtable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Memtable SSTable SSTable Memtable SSTable Memtable SSTable SSTable SSTable Memtable SSTable Memtable SSTable SSTable SSTable SSTable Memtable SSTable SSTable Memtable SSTable
  • 38. Illustration - maybe an octopus spreading its tentacles to each node Maybe use DB Performance at Scale book style tentacles here?
  • 39. Illustration - Deuteronomy style tablets but with host/shard information instead of commandments
  • 40. Felipe's replacement idea for deuteronomy thingie CREATE TABLE system.tablets ( keyspace_name text static, table_name text static, tablet_count int static, table_id uuid, last_token bigint, … PRIMARY KEY(table_id, last_token) ) Raft-Managed Table User Table Replicas