Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

Slides

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Slides

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 516

Distributed Systems

The second half of Concurrent and Distributed Systems


https://www.cl.cam.ac.uk/teaching/current/ConcDisSys

Dr. Martin Kleppmann (mk428@cam)


University of Cambridge
Computer Science Tripos, Part IB

This work is published under a


Creative Commons BY-SA license.
A distributed system is. . .
I “. . . a system in which the failure of a computer you
didn’t even know existed can render your own computer
unusable.” — Leslie Lamport
A distributed system is. . .

I “. . . a system in which the failure of a computer you


didn’t even know existed can render your own computer
unusable.” — Leslie Lamport

I . . . multiple computers communicating via a network. . .


I . . . trying to achieve some task together
I Consists of “nodes” (computer, phone, car, robot, . . . )
Recommended reading

I van Steen & Tanenbaum.


“Distributed Systems”
(any ed), free ebook available
I Cachin, Guerraoui & Rodrigues.
“Introduction to Reliable and Secure Distributed
Programming” (2nd ed), Springer 2011
I Kleppmann.
“Designing Data-Intensive Applications”,
O’Reilly 2017
I Bacon & Harris.
“Operating Systems: Concurrent and Distributed
Software Design”, Addison-Wesley 2003
Relationships with other courses
I Concurrent Systems – Part IB
(every distributed system is also concurrent)
I Operating Systems – Part IA
(inter-process communication, scheduling)
I Databases – Part IA
(many modern databases are distributed)
I Computer Networking – Part IB Lent term
(distributed systems involve network communication)
I Further Java – Part IB Michaelmas
(distributed programming practical exercises)
I Security – Part IB Easter term
(network protocols with encryption & authentication)
I Cloud Computing – Part II
(distributed systems for processing large amounts of data)
Why make a system distributed?
Why make a system distributed?

I It’s inherently distributed:


e.g. sending a message from your mobile phone to your
friend’s phone
Why make a system distributed?

I It’s inherently distributed:


e.g. sending a message from your mobile phone to your
friend’s phone
I For better reliability:
even if one node fails, the system as a whole keeps
functioning
Why make a system distributed?

I It’s inherently distributed:


e.g. sending a message from your mobile phone to your
friend’s phone
I For better reliability:
even if one node fails, the system as a whole keeps
functioning
I For better performance:
get data from a nearby node rather than one halfway
round the world
Why make a system distributed?

I It’s inherently distributed:


e.g. sending a message from your mobile phone to your
friend’s phone
I For better reliability:
even if one node fails, the system as a whole keeps
functioning
I For better performance:
get data from a nearby node rather than one halfway
round the world
I To solve bigger problems:
e.g. huge amounts of data, can’t fit on one machine
Why NOT make a system distributed?

The trouble with distributed systems:


I Communication may fail (and we might not even know it
has failed).
I Processes may crash (and we might not know).
I All of this may happen nondeterministically.
Why NOT make a system distributed?

The trouble with distributed systems:


I Communication may fail (and we might not even know it
has failed).
I Processes may crash (and we might not know).
I All of this may happen nondeterministically.

Fault tolerance: we want the system as a whole to continue


working, even when some parts are faulty.

This is hard.

Writing a program to run on a single computer is


comparatively easy?!
Distributed Systems and Computer Networking

We use a simple abstraction of communication:

message m
node i node j
Distributed Systems and Computer Networking

We use a simple abstraction of communication:

message m
node i node j

Reality is much more complex:


I Various network operators:
eduroam, home DSL, cellular data, coffee shop wifi,
submarine cable, satellite. . .

I Physical communication:
electric current, radio waves, laser, hard drives in a van. . .
Hard drives in a van?!

https://docs.aws.amazon.com/snowball/latest/ug/using-device.html

High latency, high bandwidth!


Latency and bandwidth

Latency: time until message arrives


I In the same building/datacenter: ≈ 1 ms
I One continent to another: ≈ 100 ms
I Hard drives in a van: ≈ 1 day
Latency and bandwidth

Latency: time until message arrives


I In the same building/datacenter: ≈ 1 ms
I One continent to another: ≈ 100 ms
I Hard drives in a van: ≈ 1 day

Bandwidth: data volume per unit time


I 3G cellular data: ≈ 1 Mbit/s
I Home broadband: ≈ 10 Mbit/s
I Hard drives in a van: 50 TB/box ≈ 1 Gbit/s

(Very rough numbers, vary hugely in practice!)


Client-server example: the web

Time flows from top to bottom.

client server www.cst.cam.ac.uk


Client-server example: the web

Time flows from top to bottom.

client server www.cst.cam.ac.uk


GET /teac
hing/20 21/ConcD
isSys
Client-server example: the web

Time flows from top to bottom.

client server www.cst.cam.ac.uk


GET /teac
hing/20 21/ConcD
isSys

html>...
! DO CT Y P E html><
<
request message response message
Client-server example: online payments

online shop payments service


Client-server example: online payments

online shop payments service


charge £3
.99 to cred
it card 123
4. . .
Client-server example: online payments

online shop payments service


charge £3
.99 to cred
it card 123
4. . .

success
Remote Procedure Call (RPC) example

// Online shop handling customer's card details


Card card = new Card();
card.setCardNumber("1234 5678 8765 4321");
card.setExpiryDate("10/2024");
card.setCVC("123");

Result result = paymentsService.processPayment(card,


3.99, Currency.GBP);

if (result.isSuccess()) {
fulfilOrder();
}
Remote Procedure Call (RPC) example

// Online shop handling customer's card details


Card card = new Card();
card.setCardNumber("1234 5678 8765 4321");
card.setExpiryDate("10/2024");
card.setCVC("123");

Result result = paymentsService.processPayment(card,


3.99, Currency.GBP);

if (result.isSuccess()) {
fulfilOrder();
}

Implementation of this function is on another node!


online shop RPC client RPC server payment service

processPayment() stub

waiting
online shop RPC client RPC server payment service

processPayment() stub
marshal args
m1
unmarshal args

waiting

{
"request": "processPayment",
"card": {
"number": "1234567887654321",
"expiryDate": "10/2024",
m1 = "CVC": "123"
},
"amount": 3.99,
"currency": "GBP"
}
online shop RPC client RPC server payment service

processPayment() stub
marshal args
m1
unmarshal args
processPayment()
waiting implementation

{
"request": "processPayment",
"card": {
"number": "1234567887654321",
"expiryDate": "10/2024",
m1 = "CVC": "123"
},
"amount": 3.99,
"currency": "GBP"
}
online shop RPC client RPC server payment service

processPayment() stub
marshal args
m1
unmarshal args
processPayment()
waiting implementation
m2 marshal result
unmarshal result

{
"request": "processPayment",
"card": {
"number": "1234567887654321", {
"expiryDate": "10/2024", "result": "success",
m1 = "CVC": "123" m2 = "id": "XP61hHw2Rvo"
}, }
"amount": 3.99,
"currency": "GBP"
}
online shop RPC client RPC server payment service

processPayment() stub
marshal args
m1
unmarshal args
processPayment()
waiting implementation
m2 marshal result
unmarshal result
function returns

{
"request": "processPayment",
"card": {
"number": "1234567887654321", {
"expiryDate": "10/2024", "result": "success",
m1 = "CVC": "123" m2 = "id": "XP61hHw2Rvo"
}, }
"amount": 3.99,
"currency": "GBP"
}
Remote Procedure Call (RPC)

Ideally, RPC makes a call to a remote function look the same


as a local function call.

“Location transparency”:
system hides where a resource is located.
Remote Procedure Call (RPC)

Ideally, RPC makes a call to a remote function look the same


as a local function call.

“Location transparency”:
system hides where a resource is located.

In practice. . .
I what if the service crashes during the function call?
I what if a message is lost?
I what if a message is delayed?
I if something goes wrong, is it safe to retry?
RPC history

I SunRPC/ONC RPC (1980s, basis for NFS)


I CORBA: object-oriented middleware, hot in the 1990s
I Microsoft’s DCOM and Java RMI (similar to CORBA)
I SOAP/XML-RPC: RPC using XML and HTTP (1998)
I Thrift (Facebook, 2007)
I gRPC (Google, 2015)
I REST (often with JSON)
I Ajax in web browsers
RPC/REST in JavaScript
let args = {amount: 3.99, currency: 'GBP', /*...*/ };
let request = {
method: 'POST',
body: JSON.stringify(args),
headers: {'Content-Type': 'application/json'}
};

fetch('https://example.com/payments', request)
.then((response) => {
if (response.ok) success(response.json());
else failure(response.status); // server error
})
.catch((error) => {
failure(error); // network error
});
RPC in enterprise systems

“Service-oriented architecture” (SOA) / “microservices”:

splitting a large software application into multiple services


(on multiple nodes) that communicate via RPC.
RPC in enterprise systems

“Service-oriented architecture” (SOA) / “microservices”:

splitting a large software application into multiple services


(on multiple nodes) that communicate via RPC.

Different services implemented in different languages:


I interoperability: datatype conversions
I Interface Definition Language (IDL):
language-independent API specification
gRPC IDL example
message PaymentRequest {
message Card {
required string cardNumber = 1;
optional int32 expiryMonth = 2;
optional int32 expiryYear = 3;
optional int32 CVC = 4;
}
enum Currency { GBP = 1; USD = 2; }

required Card card = 1;


required int64 amount = 2;
required Currency currency = 3;
}

message PaymentStatus {
required bool success = 1;
optional string errorMessage = 2;
}

service PaymentService {
rpc ProcessPayment(PaymentRequest) returns (PaymentStatus) {}
}
Lecture 2

Models of distributed systems


The two generals problem
city
attack? attack?

army 1 army 2
messengers
The two generals problem
city
attack? attack?

army 1 army 2
messengers

army 1 army 2 outcome


does not attack does not attack nothing happens
attacks does not attack army 1 defeated
does not attack attacks army 2 defeated
attacks attacks city captured

Desired: army 1 attacks if and only if army 2 attacks


The two generals problem
general 1 general 2
attack 10
Nov, okay
?

ed!
10 Nov agre
The two generals problem
general 1 general 2
attack 10
Nov, okay
?

ed!
10 Nov agre

From general 1’s point of view, this is indistinguishable from:

general 1 general 2
attack 10
Nov, okay
?
How should the generals decide?
1. General 1 always attacks, even if no response is received?
I Send lots of messengers to increase probability that one
will get through
I If all are captured, general 2 does not know about the
attack, so general 1 loses
How should the generals decide?
1. General 1 always attacks, even if no response is received?
I Send lots of messengers to increase probability that one
will get through
I If all are captured, general 2 does not know about the
attack, so general 1 loses

2. General 1 only attacks if positive response from general 2


is received?
I Now general 1 is safe
I But general 2 knows that general 1 will only attack if
general 2’s response gets through
I Now general 2 is in the same situation as general 1 in
option 1
How should the generals decide?
1. General 1 always attacks, even if no response is received?
I Send lots of messengers to increase probability that one
will get through
I If all are captured, general 2 does not know about the
attack, so general 1 loses

2. General 1 only attacks if positive response from general 2


is received?
I Now general 1 is safe
I But general 2 knows that general 1 will only attack if
general 2’s response gets through
I Now general 2 is in the same situation as general 1 in
option 1
No common knowledge: the only way of knowing
something is to communicate it
The two generals problem applied

customer
dispatch goods charge credit card

online shop payments service


RPC
The two generals problem applied

customer
dispatch goods charge credit card

online shop payments service


RPC

online shop payments service outcome


does not dispatch does not charge nothing happens
dispatches does not charge shop loses money
does not dispatch charges customer complaint
dispatches charges everyone happy

Desired: online shop dispatches if and only if payment made


The Byzantine generals problem
army 3

attack?

messengers messengers
city

attack? attack?

army 1 army 2
messengers

Problem: some of the generals might be traitors


Generals that might lie

general 1 general 2 general 3


attack!
attack!
general 1 said
retreat!
Generals that might lie

general 1 general 2 general 3


attack!
attack!
general 1 said
retreat!

From general 3’s point of view, this is indistinguishable from:

general 1 general 2 general 3


attack!
retreat!
general 1 said
retreat!
The Byzantine generals problem

I Up to f generals might behave maliciously


I Honest generals don’t know who the malicious ones are
I The malicious generals may collude
I Nevertheless, honest generals must agree on plan
The Byzantine generals problem

I Up to f generals might behave maliciously


I Honest generals don’t know who the malicious ones are
I The malicious generals may collude
I Nevertheless, honest generals must agree on plan

I Theorem: need 3f + 1 generals in total to tolerate f


malicious generals (i.e. < 31 may be malicious)
I Cryptography (digital signatures) helps – but problem
remains hard
Trust relationships and malicious behaviour
customer

agree?

RPC RPC
order

agree? agree?

online shop payments service


RPC

Who can trust whom?


The Byzantine empire (650 CE)
Byzantium/Constantinople/Istanbul

Source: https://commons.wikimedia.org/wiki/File:Byzantiumby650AD.svg

“Byzantine” has long been used for “excessively complicated,


bureaucratic, devious” (e.g. “the Byzantine tax law”)
System models

We have seen two thought experiments:


I Two generals problem: a model of networks
I Byzantine generals problem: a model of node behaviour
In real systems, both nodes and networks may be faulty!
System models

We have seen two thought experiments:


I Two generals problem: a model of networks
I Byzantine generals problem: a model of node behaviour
In real systems, both nodes and networks may be faulty!

Capture assumptions in a system model consisting of:


I Network behaviour (e.g. message loss)
I Node behaviour (e.g. crashes)
I Timing behaviour (e.g. latency)
Choice of models for each of these parts.
Networks are unreliable

In the sea, sharks bite fibre optic cables


https://slate.com/technology/2014/08/
shark-attacks-threaten-google-s-undersea-internet-cables-video.html

On land, cows step on the cables


https://twitter.com/uhoelzle/status/1263333283107991558
System model: network behaviour
Assume bidirectional point-to-point communication between
two nodes, with one of:
System model: network behaviour
Assume bidirectional point-to-point communication between
two nodes, with one of:
I Reliable (perfect) links:
A message is received if and only if it is sent.
Messages may be reordered.
System model: network behaviour
Assume bidirectional point-to-point communication between
two nodes, with one of:
I Reliable (perfect) links:
A message is received if and only if it is sent.
Messages may be reordered.
I Fair-loss links:
Messages may be lost, duplicated, or reordered.
If you keep retrying, a message eventually gets through.
System model: network behaviour
Assume bidirectional point-to-point communication between
two nodes, with one of:
I Reliable (perfect) links:
A message is received if and only if it is sent.
Messages may be reordered.
I Fair-loss links:
Messages may be lost, duplicated, or reordered.
If you keep retrying, a message eventually gets through.
I Arbitrary links (active adversary):
A malicious adversary may interfere with messages
(eavesdrop, modify, drop, spoof, replay).
System model: network behaviour
Assume bidirectional point-to-point communication between
two nodes, with one of:
I Reliable (perfect) links:
A message is received if and only if it is sent.
Messages may be reordered.
I Fair-loss links:
Messages may be lost, duplicated, or reordered.
If you keep retrying, a message eventually gets through.
I Arbitrary links (active adversary):
A malicious adversary may interfere with messages
(eavesdrop, modify, drop, spoof, replay).

Network partition: some links dropping/delaying all


messages for extended period of time
System model: network behaviour
Assume bidirectional point-to-point communication between
two nodes, with one of:
I Reliable (perfect) links:
A message is received if and only if it is sent.
Messages may be reordered. retry +
I Fair-loss links: dedup
Messages may be lost, duplicated, or reordered.
If you keep retrying, a message eventually gets through.
I Arbitrary links (active adversary):
A malicious adversary may interfere with messages
(eavesdrop, modify, drop, spoof, replay).

Network partition: some links dropping/delaying all


messages for extended period of time
System model: network behaviour
Assume bidirectional point-to-point communication between
two nodes, with one of:
I Reliable (perfect) links:
A message is received if and only if it is sent.
Messages may be reordered. retry +
I Fair-loss links: dedup
Messages may be lost, duplicated, or reordered.
If you keep retrying, a message eventually gets through.
I Arbitrary links (active adversary): TLS
A malicious adversary may interfere with messages
(eavesdrop, modify, drop, spoof, replay).

Network partition: some links dropping/delaying all


messages for extended period of time
System model: node behaviour
Each node executes a specified algorithm,
assuming one of the following:
I Crash-stop (fail-stop):
A node is faulty if it crashes (at any moment).
After crashing, it stops executing forever.
System model: node behaviour
Each node executes a specified algorithm,
assuming one of the following:
I Crash-stop (fail-stop):
A node is faulty if it crashes (at any moment).
After crashing, it stops executing forever.
I Crash-recovery (fail-recovery):
A node may crash at any moment, losing its in-memory
state. It may resume executing sometime later.
System model: node behaviour
Each node executes a specified algorithm,
assuming one of the following:
I Crash-stop (fail-stop):
A node is faulty if it crashes (at any moment).
After crashing, it stops executing forever.
I Crash-recovery (fail-recovery):
A node may crash at any moment, losing its in-memory
state. It may resume executing sometime later.
I Byzantine (fail-arbitrary):
A node is faulty if it deviates from the algorithm.
Faulty nodes may do anything, including crashing or
malicious behaviour.

A node that is not faulty is called “correct”


System model: synchrony (timing) assumptions
Assume one of the following for network and nodes:
I Synchronous:
Message latency no greater than a known upper bound.
Nodes execute algorithm at a known speed.
System model: synchrony (timing) assumptions
Assume one of the following for network and nodes:
I Synchronous:
Message latency no greater than a known upper bound.
Nodes execute algorithm at a known speed.
I Partially synchronous:
The system is asynchronous for some finite (but
unknown) periods of time, synchronous otherwise.
System model: synchrony (timing) assumptions
Assume one of the following for network and nodes:
I Synchronous:
Message latency no greater than a known upper bound.
Nodes execute algorithm at a known speed.
I Partially synchronous:
The system is asynchronous for some finite (but
unknown) periods of time, synchronous otherwise.
I Asynchronous:
Messages can be delayed arbitrarily.
Nodes can pause execution arbitrarily.
No timing guarantees at all.

Note: other parts of computer science use the terms


“synchronous” and “asynchronous” differently.
Violations of synchrony in practice
Networks usually have quite predictable latency, which can
occasionally increase:
I Message loss requiring retry
I Congestion/contention causing queueing
I Network/route reconfiguration
Violations of synchrony in practice
Networks usually have quite predictable latency, which can
occasionally increase:
I Message loss requiring retry
I Congestion/contention causing queueing
I Network/route reconfiguration

Nodes usually execute code at a predictable speed, with


occasional pauses:
I Operating system scheduling issues, e.g. priority inversion
I Stop-the-world garbage collection pauses
I Page faults, swap, thrashing
Real-time operating systems (RTOS) provide scheduling
guarantees, but most distributed systems do not use RTOS
System models summary

For each of the three parts, pick one:

I Network:
reliable, fair-loss, or arbitrary

I Nodes:
crash-stop, crash-recovery, or Byzantine

I Timing:
synchronous, partially synchronous, or asynchronous

This is the basis for any distributed algorithm.


If your assumptions are wrong, all bets are off!
Availability
Online shop wants to sell stuff 24/7!
Service unavailability = downtime = losing money

Availability = uptime = fraction of time that a service is


functioning correctly
I “Two nines” = 99% up = down 3.7 days/year
I “Three nines” = 99.9% up = down 8.8 hours/year
I “Four nines” = 99.99% up = down 53 minutes/year
I “Five nines” = 99.999% up = down 5.3 minutes/year
Availability
Online shop wants to sell stuff 24/7!
Service unavailability = downtime = losing money

Availability = uptime = fraction of time that a service is


functioning correctly
I “Two nines” = 99% up = down 3.7 days/year
I “Three nines” = 99.9% up = down 8.8 hours/year
I “Four nines” = 99.99% up = down 53 minutes/year
I “Five nines” = 99.999% up = down 5.3 minutes/year

Service-Level Objective (SLO):


e.g. “99.9% of requests in a day get a response in 200 ms”

Service-Level Agreement (SLA):


contract specifying some SLO, penalties for violation
Achieving high availability: fault tolerance

Failure: system as a whole isn’t working

Fault: some part of the system isn’t working


I Node fault: crash (crash-stop/crash-recovery),
deviating from algorithm (Byzantine)
I Network fault: dropping or significantly delaying messages

Fault tolerance:
system as a whole continues working, despite faults
(some maximum number of faults assumed)

Single point of failure (SPOF):


node/network link whose fault leads to failure
Failure detectors
Failure detector:
algorithm that detects whether another node is faulty

Perfect failure detector:


labels a node as faulty if and only if it has crashed
Failure detectors
Failure detector:
algorithm that detects whether another node is faulty

Perfect failure detector:


labels a node as faulty if and only if it has crashed

Typical implementation for crash-stop/crash-recovery:


send message, await response, label node as crashed if no
reply within some timeout
Failure detectors
Failure detector:
algorithm that detects whether another node is faulty

Perfect failure detector:


labels a node as faulty if and only if it has crashed

Typical implementation for crash-stop/crash-recovery:


send message, await response, label node as crashed if no
reply within some timeout

Problem:
cannot tell the difference between crashed node, temporarily
unresponsive node, lost message, and delayed message
Failure detection in partially synchronous systems

Perfect timeout-based failure detector exists only in a


synchronous crash-stop system with reliable links.

Eventually perfect failure detector:


I May temporarily label a node as crashed,
even though it is correct
I May temporarily label a node as correct,
even though it has crashed
I But eventually, labels a node as crashed
if and only if it has crashed

Reflects fact that detection is not instantaneous, and we may


have spurious timeouts
Lecture 3

Time, clocks, and ordering of events


A detective story

In the night from 30 June to 1 July 2012 (UK time), many


online services and systems around the world crashed
simultaneously.

Servers locked up and stopped responding.

Some airlines could not process any reservations or check-ins


for several hours.

What happened?
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
I Performance measurements, statistics, profiling
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
I Performance measurements, statistics, profiling
I Log files & databases: record when an event occurred
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
I Performance measurements, statistics, profiling
I Log files & databases: record when an event occurred
I Data with time-limited validity (e.g. cache entries)
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
I Performance measurements, statistics, profiling
I Log files & databases: record when an event occurred
I Data with time-limited validity (e.g. cache entries)
I Determining order of events across several nodes
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
I Performance measurements, statistics, profiling
I Log files & databases: record when an event occurred
I Data with time-limited validity (e.g. cache entries)
I Determining order of events across several nodes

We distinguish two types of clock:


I physical clocks: count number of seconds elapsed
I logical clocks: count events, e.g. messages sent
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
I Performance measurements, statistics, profiling
I Log files & databases: record when an event occurred
I Data with time-limited validity (e.g. cache entries)
I Determining order of events across several nodes

We distinguish two types of clock:


I physical clocks: count number of seconds elapsed
I logical clocks: count events, e.g. messages sent

NB. Clock in digital electronics (oscillator)


6= clock in distributed systems (source of timestamps)
Quartz clocks

I Quartz crystal
laser-trimmed to
mechanically resonate at a
specific frequency
I Piezoelectric effect:
mechanical force ⇔
electric field
I Oscillator circuit produces
signal at resonant
frequency
I Count number of cycles to
measure elapsed time
Quartz clock error: drift
I One clock runs slightly fast, another slightly slow
I Drift measured in parts per million (ppm)
I 1 ppm = 1 microsecond/second = 86 ms/day = 32 s/year
I Most computer clocks correct within ≈ 50 ppm

Temperature
significantly
affects drift
Atomic clocks
I Caesium-133 has a
resonance (“hyperfine
transition”) at ≈ 9 GHz
I Tune an electronic
oscillator to that resonant
frequency
I 1 second = 9,192,631,770
periods of that signal
I Accuracy ≈ 1 in 10−14 (1 https:
second in 3 million years) //www.microsemi.com/product-directory/
cesium-frequency-references/
I Price ≈ £20,000 (?) 4115-5071a-cesium-primary-frequency-standard
(can get cheaper rubidium
clocks for ≈ £1,000)
GPS as time source

I 31 satellites, each carrying


an atomic clock
I satellite broadcasts
current time and location
I calculate position from
speed-of-light delay
between satellite and
receiver
I corrections for
atmospheric effects,
relativity, etc.
https://commons.wikimedia.org/wiki/File:
I in datacenters, need Gps-atmospheric-efects.png
antenna on the roof
Coordinated Universal Time (UTC)
Greenwich Mean Time (GMT, solar
time): it’s noon when the sun is in the
south, as seen from the Greenwich meridian
Coordinated Universal Time (UTC)
Greenwich Mean Time (GMT, solar
time): it’s noon when the sun is in the
south, as seen from the Greenwich meridian

International Atomic Time (TAI): 1 day


is 24 × 60 × 60 × 9,192,631,770 periods of
caesium-133’s resonant frequency
Coordinated Universal Time (UTC)
Greenwich Mean Time (GMT, solar
time): it’s noon when the sun is in the
south, as seen from the Greenwich meridian

International Atomic Time (TAI): 1 day


is 24 × 60 × 60 × 9,192,631,770 periods of
caesium-133’s resonant frequency

Problem: speed of Earth’s rotation is not


constant
Coordinated Universal Time (UTC)
Greenwich Mean Time (GMT, solar
time): it’s noon when the sun is in the
south, as seen from the Greenwich meridian

International Atomic Time (TAI): 1 day


is 24 × 60 × 60 × 9,192,631,770 periods of
caesium-133’s resonant frequency

Problem: speed of Earth’s rotation is not


constant

Compromise: UTC is TAI with corrections


to account for Earth rotation
Coordinated Universal Time (UTC)
Greenwich Mean Time (GMT, solar
time): it’s noon when the sun is in the
south, as seen from the Greenwich meridian

International Atomic Time (TAI): 1 day


is 24 × 60 × 60 × 9,192,631,770 periods of
caesium-133’s resonant frequency

Problem: speed of Earth’s rotation is not


constant

Compromise: UTC is TAI with corrections


to account for Earth rotation

Time zones and daylight savings time


are offsets to UTC
Leap seconds
Every year, on 30 June and 31 December at 23:59:59 UTC,
one of three things happens:
I The clock immediately jumps forward to 00:00:00,
skipping one second (negative leap second)
I The clock moves to 00:00:00 after one second, as usual
I The clock moves to 23:59:60 after one second, and then
moves to 00:00:00 after one further second
(positive leap second)
This is announced several months beforehand.

http://leapsecond.com/notes/leap-watch.htm
How computers represent timestamps

Two most common representations:


I Unix time: number of seconds since 1 January 1970
00:00:00 UTC (the “epoch”), not counting leap seconds
I ISO 8601: year, month, day, hour, minute, second, and
timezone offset relative to UTC
example: 2020-11-09T09:50:17+00:00
How computers represent timestamps

Two most common representations:


I Unix time: number of seconds since 1 January 1970
00:00:00 UTC (the “epoch”), not counting leap seconds
I ISO 8601: year, month, day, hour, minute, second, and
timezone offset relative to UTC
example: 2020-11-09T09:50:17+00:00

Conversion between the two requires:


I Gregorian calendar: 365 days in a year, except leap years
(year % 4 == 0 && (year % 100 != 0 ||
year % 400 == 0))
I Knowledge of past and future leap seconds. . . ?!
How most software deals with leap seconds

By ignoring them!

https://www.flickr.com/
photos/ru boff/
37915499055/
How most software deals with leap seconds

By ignoring them!

However, OS and DistSys often need


timings with sub-second accuracy.

https://www.flickr.com/
photos/ru boff/
37915499055/
How most software deals with leap seconds

By ignoring them!

However, OS and DistSys often need


timings with sub-second accuracy.

30 June 2012: bug in Linux kernel caused


livelock on leap second, causing many
Internet services to go down

https://www.flickr.com/
photos/ru boff/
37915499055/
How most software deals with leap seconds

By ignoring them!

However, OS and DistSys often need


timings with sub-second accuracy.

30 June 2012: bug in Linux kernel caused


livelock on leap second, causing many
Internet services to go down

Pragmatic solution: “smear” (spread out)


the leap second over the course of a day
https://www.flickr.com/
photos/ru boff/
37915499055/
Clock synchronisation

Computers track physical time/UTC with a quartz clock


(with battery, continues running when power is off)

Due to clock drift, clock error gradually increases


Clock synchronisation

Computers track physical time/UTC with a quartz clock


(with battery, continues running when power is off)

Due to clock drift, clock error gradually increases

Clock skew: difference between two clocks at a point in time


Clock synchronisation

Computers track physical time/UTC with a quartz clock


(with battery, continues running when power is off)

Due to clock drift, clock error gradually increases

Clock skew: difference between two clocks at a point in time

Solution: Periodically get the current time from a server that


has a more accurate time source (atomic clock or GPS
receiver)

Protocols: Network Time Protocol (NTP),


Precision Time Protocol (PTP)
Network Time Protocol (NTP)
Many operating system vendors run NTP servers,
configure OS to use them by default
Network Time Protocol (NTP)
Many operating system vendors run NTP servers,
configure OS to use them by default

Hierarchy of clock servers arranged into strata:


I Stratum 0: atomic clock or GPS receiver
I Stratum 1: synced directly with stratum 0 device
I Stratum 2: servers that sync with stratum 1, etc.
Network Time Protocol (NTP)
Many operating system vendors run NTP servers,
configure OS to use them by default

Hierarchy of clock servers arranged into strata:


I Stratum 0: atomic clock or GPS receiver
I Stratum 1: synced directly with stratum 0 device
I Stratum 2: servers that sync with stratum 1, etc.

May contact multiple servers, discard outliers, average rest

Makes multiple requests to the same server, use statistics to


reduce random error due to variations in network latency

Reduces clock skew to a few milliseconds in good network


conditions, but can be much worse!
Estimating time over a network
NTP client NTP server
Estimating time over a network
NTP client NTP server

t1 request: t1
t2
Estimating time over a network
NTP client NTP server

t1 request: t1
t2
response: (t1 , t2 , t3 ) t3
t4
Estimating time over a network
NTP client NTP server

t1 request: t1
t2
response: (t1 , t2 , t3 ) t3
t4

Round-trip network delay: δ = (t4 − t1 ) − (t3 − t2 )


Estimating time over a network
NTP client NTP server

t1 request: t1
t2
response: (t1 , t2 , t3 ) t3
t4

Round-trip network delay: δ = (t4 − t1 ) − (t3 − t2 )

δ
Estimated server time when client receives response: t3 +
2
Estimating time over a network
NTP client NTP server

t1 request: t1
t2
response: (t1 , t2 , t3 ) t3
t4

Round-trip network delay: δ = (t4 − t1 ) − (t3 − t2 )

δ
Estimated server time when client receives response: t3 +
2
δ t2 − t1 + t3 − t4
Estimated clock skew: θ = t3 + − t4 =
2 2
Correcting clock skew
Once the client has estimated the clock skew θ, it needs to
apply that correction to its clock.

I If |θ| < 125 ms, slew the clock:


slightly speed it up or slow it down by up to 500 ppm
(brings clocks in sync within ≈ 5 minutes)
Correcting clock skew
Once the client has estimated the clock skew θ, it needs to
apply that correction to its clock.

I If |θ| < 125 ms, slew the clock:


slightly speed it up or slow it down by up to 500 ppm
(brings clocks in sync within ≈ 5 minutes)

I If 125 ms ≤ |θ| < 1,000 s, step the clock:


suddenly reset client clock to estimated server timestamp
Correcting clock skew
Once the client has estimated the clock skew θ, it needs to
apply that correction to its clock.

I If |θ| < 125 ms, slew the clock:


slightly speed it up or slow it down by up to 500 ppm
(brings clocks in sync within ≈ 5 minutes)

I If 125 ms ≤ |θ| < 1,000 s, step the clock:


suddenly reset client clock to estimated server timestamp

I If |θ| ≥ 1,000 s, panic and do nothing


(leave the problem for a human operator to resolve)

Systems that rely on clock sync need to monitor clock skew!


http://www.ntp.org/ntpfaq/NTP-s-algo.htm
Monotonic and time-of-day clocks

// BAD:
long startTime = System.currentTimeMillis();
doSomething();
long endTime = System.currentTimeMillis();
long elapsedMillis = endTime - startTime;
// elapsedMillis may be negative!
Monotonic and time-of-day clocks

// BAD:
long startTime = System.currentTimeMillis();
doSomething();
long endTime = System.currentTimeMillis();
long elapsedMillis = endTime - startTime;
// elapsedMillis may be negative!
NTP client steps the clock during this
Monotonic and time-of-day clocks

// BAD:
long startTime = System.currentTimeMillis();
doSomething();
long endTime = System.currentTimeMillis();
long elapsedMillis = endTime - startTime;
// elapsedMillis may be negative!
NTP client steps the clock during this
// GOOD:
long startTime = System.nanoTime();
doSomething();
long endTime = System.nanoTime();
long elapsedNanos = endTime - startTime;
// elapsedNanos is always >= 0
Monotonic and time-of-day clocks
Time-of-day clock:
I Time since a fixed date (e.g. 1 January 1970 epoch)

Monotonic clock:
I Time since arbitrary point (e.g. when machine booted up)
Monotonic and time-of-day clocks
Time-of-day clock:
I Time since a fixed date (e.g. 1 January 1970 epoch)
I May suddenly move forwards or backwards (NTP
stepping), subject to leap second adjustments

Monotonic clock:
I Time since arbitrary point (e.g. when machine booted up)
I Always moves forwards at near-constant rate
Monotonic and time-of-day clocks
Time-of-day clock:
I Time since a fixed date (e.g. 1 January 1970 epoch)
I May suddenly move forwards or backwards (NTP
stepping), subject to leap second adjustments
I Timestamps can be compared across nodes (if synced)

Monotonic clock:
I Time since arbitrary point (e.g. when machine booted up)
I Always moves forwards at near-constant rate
I Good for measuring elapsed time on a single node
Monotonic and time-of-day clocks
Time-of-day clock:
I Time since a fixed date (e.g. 1 January 1970 epoch)
I May suddenly move forwards or backwards (NTP
stepping), subject to leap second adjustments
I Timestamps can be compared across nodes (if synced)
I Java: System.currentTimeMillis()
I Linux: clock_gettime(CLOCK_REALTIME)

Monotonic clock:
I Time since arbitrary point (e.g. when machine booted up)
I Always moves forwards at near-constant rate
I Good for measuring elapsed time on a single node
I Java: System.nanoTime()
I Linux: clock_gettime(CLOCK_MONOTONIC)
Ordering of messages

user A user B user C


m1
m1

m1 = “A says: The moon is made of cheese!”


Ordering of messages

user A user B user C


m1
m1
m2 m2

m1 = “A says: The moon is made of cheese!”


m2 = “B says: Oh no it isn’t!”
Ordering of messages

user A user B user C


m1
m1
m2 m2

m1 = “A says: The moon is made of cheese!”


m2 = “B says: Oh no it isn’t!”

C sees m2 first, m1 second,


even though logically m1 happened before m2 .
Ordering of messages using timestamps?

user A user B user C


t1 m1
m1
t2 m2
m2

m1 = (t1 , “A says: The moon is made of cheese!”)


m2 = (t2 , “B says: Oh no it isn’t!”)
Ordering of messages using timestamps?

user A user B user C


t1 m1
m1
t2 m2
m2

m1 = (t1 , “A says: The moon is made of cheese!”)


m2 = (t2 , “B says: Oh no it isn’t!”)

Problem: even with synced clocks, t2 < t1 is possible.


Timestamp order is inconsistent with expected order!
The happens-before relation
An event is something happening at one node (sending or
receiving a message, or a local execution step).

We say event a happens before event b (written a → b) iff:


The happens-before relation
An event is something happening at one node (sending or
receiving a message, or a local execution step).

We say event a happens before event b (written a → b) iff:


I a and b occurred at the same node, and a occurred
before b in that node’s local execution order; or
The happens-before relation
An event is something happening at one node (sending or
receiving a message, or a local execution step).

We say event a happens before event b (written a → b) iff:


I a and b occurred at the same node, and a occurred
before b in that node’s local execution order; or
I event a is the sending of some message m, and event b is
the receipt of that same message m (assuming sent
messages are unique); or
The happens-before relation
An event is something happening at one node (sending or
receiving a message, or a local execution step).

We say event a happens before event b (written a → b) iff:


I a and b occurred at the same node, and a occurred
before b in that node’s local execution order; or
I event a is the sending of some message m, and event b is
the receipt of that same message m (assuming sent
messages are unique); or
I there exists an event c such that a → c and c → b.
The happens-before relation
An event is something happening at one node (sending or
receiving a message, or a local execution step).

We say event a happens before event b (written a → b) iff:


I a and b occurred at the same node, and a occurred
before b in that node’s local execution order; or
I event a is the sending of some message m, and event b is
the receipt of that same message m (assuming sent
messages are unique); or
I there exists an event c such that a → c and c → b.

The happens-before relation is a partial order: it is possible


that neither a → b nor b → a. In that case, a and b are
concurrent (written a k b).
Happens-before relation example

A B C
a e
b m1
c
d m2
f
Happens-before relation example

A B C
a e
b m1
c
d m2
f

I a → b, c → d, and e → f due to node execution order


Happens-before relation example

A B C
a e
b m1
c
d m2
f

I a → b, c → d, and e → f due to node execution order


I b → c and d → f due to messages m1 and m2
Happens-before relation example

A B C
a e
b m1
c
d m2
f

I a → b, c → d, and e → f due to node execution order


I b → c and d → f due to messages m1 and m2
I a → c, a → d, a → f , b → d, b → f , and c → f due to
transitivity
Happens-before relation example

A B C
a e
b m1
c
d m2
f

I a → b, c → d, and e → f due to node execution order


I b → c and d → f due to messages m1 and m2
I a → c, a → d, a → f , b → d, b → f , and c → f due to
transitivity
I a k e, b k e, c k e, and d k e
Causality
Taken from physics (relativity).
I When a → b, then a might have caused b.
I When a k b, we know that a cannot have caused b.
Happens-before relation encodes potential causality.
Causality
Taken from physics (relativity).
I When a → b, then a might have caused b.
I When a k b, we know that a cannot have caused b.
Happens-before relation encodes potential causality.

distance in space
a b time
Causality
Taken from physics (relativity).
I When a → b, then a might have caused b.
I When a k b, we know that a cannot have caused b.
Happens-before relation encodes potential causality.

distance in space
a b time

light from a
Causality
Taken from physics (relativity).
I When a → b, then a might have caused b.
I When a k b, we know that a cannot have caused b.
Happens-before relation encodes potential causality.

distance in space
a b time

light from a light from b


Causality
Taken from physics (relativity).
I When a → b, then a might have caused b.
I When a k b, we know that a cannot have caused b.
Happens-before relation encodes potential causality.

distance in space
a b time

light from a light from b


c
Causality
Taken from physics (relativity).
I When a → b, then a might have caused b.
I When a k b, we know that a cannot have caused b.
Happens-before relation encodes potential causality.

distance in space
a b time

light from a light from b


c

Let ≺ be a strict total order on events.


If (a → b) =⇒ (a ≺ b) then ≺ is a causal order
(or: ≺ is “consistent with causality”).
NB. “causal” 6= “casual”!
Lecture 4

Broadcast protocols and logical time


Physical timestamps inconsistent with causality

user A user B user C


t1 m1
m1
t2 m2
m2

m1 = (t1 , “A says: The moon is made of cheese!”)


m2 = (t2 , “B says: Oh no it isn’t!”)

Problem: even with synced clocks, t2 < t1 is possible.


Timestamp order is inconsistent with expected order!
Logical vs. physical clocks

I Physical clock: count number of seconds elapsed


I Logical clock: count number of events occurred

Physical timestamps: useful for many things, but may be


inconsistent with causality.
Logical vs. physical clocks

I Physical clock: count number of seconds elapsed


I Logical clock: count number of events occurred

Physical timestamps: useful for many things, but may be


inconsistent with causality.

Logical clocks: designed to capture causal dependencies.

(e1 → e2 ) =⇒ (T (e1 ) < T (e2 ))


Logical vs. physical clocks

I Physical clock: count number of seconds elapsed


I Logical clock: count number of events occurred

Physical timestamps: useful for many things, but may be


inconsistent with causality.

Logical clocks: designed to capture causal dependencies.

(e1 → e2 ) =⇒ (T (e1 ) < T (e2 ))

We will look at two types of logical clocks:


I Lamport clocks
I Vector clocks
Lamport clocks algorithm
on initialisation do
t := 0 . each node has its own local variable t
end on

on any event occurring at the local node do


t := t + 1
end on

on request to send message m do


t := t + 1; send (t, m) via the underlying network link
end on

on receiving (t0 , m) via the underlying network link do


t := max(t, t0 ) + 1
deliver m to the application
end on
Lamport clocks in words

I Each node maintains a counter t,


incremented on every local event e
I Let L(e) be the value of t after that increment
I Attach current t to messages sent over network
I Recipient moves its clock forward to timestamp in the
message (if greater than local counter), then increments
Lamport clocks in words

I Each node maintains a counter t,


incremented on every local event e
I Let L(e) be the value of t after that increment
I Attach current t to messages sent over network
I Recipient moves its clock forward to timestamp in the
message (if greater than local counter), then increments

Properties of this scheme:


I If a → b then L(a) < L(b)
Lamport clocks in words

I Each node maintains a counter t,


incremented on every local event e
I Let L(e) be the value of t after that increment
I Attach current t to messages sent over network
I Recipient moves its clock forward to timestamp in the
message (if greater than local counter), then increments

Properties of this scheme:


I If a → b then L(a) < L(b)
I However, L(a) < L(b) does not imply a → b
Lamport clocks in words

I Each node maintains a counter t,


incremented on every local event e
I Let L(e) be the value of t after that increment
I Attach current t to messages sent over network
I Recipient moves its clock forward to timestamp in the
message (if greater than local counter), then increments

Properties of this scheme:


I If a → b then L(a) < L(b)
I However, L(a) < L(b) does not imply a → b
I Possible that L(a) = L(b) for a 6= b
Lamport clocks example
A B C
1 1
2 (2, m1 )
3
3 4
(4, m2 ) 5
Lamport clocks example
A B C
1 1
2 (2, m1 )
3
3 4
(4, m2 ) 5

Let N (e) be the node at which event e occurred.


Then the pair (L(e), N (e)) uniquely identifies event e.
Lamport clocks example

A B C
(1, A) (1, C)
(2, A) (2, m1 )
(3, B)
(3, A) (4, B)
(4, m2 ) (5, C)

Let N (e) be the node at which event e occurred.


Then the pair (L(e), N (e)) uniquely identifies event e.
Lamport clocks example

A B C
(1, A) (1, C)
(2, A) (2, m1 )
(3, B)
(3, A) (4, B)
(4, m2 ) (5, C)

Let N (e) be the node at which event e occurred.


Then the pair (L(e), N (e)) uniquely identifies event e.

Define a total order ≺ using Lamport timestamps:

(a ≺ b) ⇐⇒ (L(a) < L(b) ∨ (L(a) = L(b) ∧ N (a) < N (b)))

This order is causal: (a → b) =⇒ (a ≺ b)


Vector clocks

Given Lamport timestamps L(a) and L(b) with L(a) < L(b)
we can’t tell whether a → b or a k b.

If we want to detect which events are concurrent, we need


vector clocks:
Vector clocks

Given Lamport timestamps L(a) and L(b) with L(a) < L(b)
we can’t tell whether a → b or a k b.

If we want to detect which events are concurrent, we need


vector clocks:
I Assume n nodes in the system, N = hN1 , N2 , . . . , Nn i
Vector clocks

Given Lamport timestamps L(a) and L(b) with L(a) < L(b)
we can’t tell whether a → b or a k b.

If we want to detect which events are concurrent, we need


vector clocks:
I Assume n nodes in the system, N = hN1 , N2 , . . . , Nn i
I Vector timestamp of event a is V (a) = ht1 , t2 , . . . , tn i
I ti is number of events observed by node Ni
Vector clocks

Given Lamport timestamps L(a) and L(b) with L(a) < L(b)
we can’t tell whether a → b or a k b.

If we want to detect which events are concurrent, we need


vector clocks:
I Assume n nodes in the system, N = hN1 , N2 , . . . , Nn i
I Vector timestamp of event a is V (a) = ht1 , t2 , . . . , tn i
I ti is number of events observed by node Ni
I Each node has a current vector timestamp T
I On event at node Ni , increment vector element T [i]
Vector clocks

Given Lamport timestamps L(a) and L(b) with L(a) < L(b)
we can’t tell whether a → b or a k b.

If we want to detect which events are concurrent, we need


vector clocks:
I Assume n nodes in the system, N = hN1 , N2 , . . . , Nn i
I Vector timestamp of event a is V (a) = ht1 , t2 , . . . , tn i
I ti is number of events observed by node Ni
I Each node has a current vector timestamp T
I On event at node Ni , increment vector element T [i]
I Attach current vector timestamp to each message
I Recipient merges message vector into its local vector
Vector clocks algorithm
on initialisation at node Ni do
T := h0, 0, . . . , 0i . local variable at node Ni
end on

on any event occurring at node Ni do


T [i] := T [i] + 1
end on

on request to send message m at node Ni do


T [i] := T [i] + 1; send (T, m) via network
end on

on receiving (T 0 , m) at node Ni via the network do


T [j] := max(T [j], T 0 [j]) for every j ∈ {1, . . . , n}
T [i] := T [i] + 1; deliver m to the application
end on
Vector clocks example
Assuming the vector of nodes is N = hA, B, Ci:

A B C
h1, 0, 0i h0, 0, 1i
(h2, 0, 0i, m
h2, 0, 0i 1)
h2, 1, 0i
h3, 0, 0i h2, 2, 0i
(h2, 2, 0i, m h2, 2, 2i
2)
Vector clocks example
Assuming the vector of nodes is N = hA, B, Ci:

A B C
h1, 0, 0i h0, 0, 1i
(h2, 0, 0i, m
h2, 0, 0i 1)
h2, 1, 0i
h3, 0, 0i h2, 2, 0i
(h2, 2, 0i, m h2, 2, 2i
2)

The vector timestamp of an event e represents a set of events,


e and its causal dependencies: {e} ∪ {a | a → e}

For example, h2, 2, 0i represents the first two events from A,


the first two events from B, and no events from C.
Vector clocks ordering
Define the following order on vector timestamps
(in a system with n nodes):
I T = T 0 iff T [i] = T 0 [i] for all i ∈ {1, . . . , n}
I T ≤ T 0 iff T [i] ≤ T 0 [i] for all i ∈ {1, . . . , n}
I T < T 0 iff T ≤ T 0 and T 6= T 0
I T k T 0 iff T 6≤ T 0 and T 0 6≤ T
Vector clocks ordering
Define the following order on vector timestamps
(in a system with n nodes):
I T = T 0 iff T [i] = T 0 [i] for all i ∈ {1, . . . , n}
I T ≤ T 0 iff T [i] ≤ T 0 [i] for all i ∈ {1, . . . , n}
I T < T 0 iff T ≤ T 0 and T 6= T 0
I T k T 0 iff T 6≤ T 0 and T 0 6≤ T

V (a) ≤ V (b) iff ({a} ∪ {e | e → a}) ⊆ ({b} ∪ {e | e → b})


Vector clocks ordering
Define the following order on vector timestamps
(in a system with n nodes):
I T = T 0 iff T [i] = T 0 [i] for all i ∈ {1, . . . , n}
I T ≤ T 0 iff T [i] ≤ T 0 [i] for all i ∈ {1, . . . , n}
I T < T 0 iff T ≤ T 0 and T 6= T 0
I T k T 0 iff T 6≤ T 0 and T 0 6≤ T

V (a) ≤ V (b) iff ({a} ∪ {e | e → a}) ⊆ ({b} ∪ {e | e → b})

Properties of this order:


I (V (a) < V (b)) ⇐⇒ (a → b)
I (V (a) = V (b)) ⇐⇒ (a = b)
I (V (a) k V (b)) ⇐⇒ (a k b)
Broadcast protocols
Broadcast (multicast) is group communication:
I One node sends message, all nodes in group deliver it
Broadcast protocols
Broadcast (multicast) is group communication:
I One node sends message, all nodes in group deliver it
I Set of group members may be fixed (static) or dynamic
Broadcast protocols
Broadcast (multicast) is group communication:
I One node sends message, all nodes in group deliver it
I Set of group members may be fixed (static) or dynamic
I If one node is faulty, remaining group members carry on
Broadcast protocols
Broadcast (multicast) is group communication:
I One node sends message, all nodes in group deliver it
I Set of group members may be fixed (static) or dynamic
I If one node is faulty, remaining group members carry on
I Note: concept is more general than IP multicast
(we build upon point-to-point messaging)
Broadcast protocols
Broadcast (multicast) is group communication:
I One node sends message, all nodes in group deliver it
I Set of group members may be fixed (static) or dynamic
I If one node is faulty, remaining group members carry on
I Note: concept is more general than IP multicast
(we build upon point-to-point messaging)

Build upon system models from lecture 2:


I Can be best-effort (may drop messages) or
reliable (non-faulty nodes deliver every message,
by retransmitting dropped messages)
Broadcast protocols
Broadcast (multicast) is group communication:
I One node sends message, all nodes in group deliver it
I Set of group members may be fixed (static) or dynamic
I If one node is faulty, remaining group members carry on
I Note: concept is more general than IP multicast
(we build upon point-to-point messaging)

Build upon system models from lecture 2:


I Can be best-effort (may drop messages) or
reliable (non-faulty nodes deliver every message,
by retransmitting dropped messages)
I Asynchronous/partially synchronous timing model
=⇒ no upper bound on message latency
Receiving versus delivering
Node A: Node B:

Application Application

Broadcast algorithm Broadcast algorithm


(middleware) (middleware)

Network
Receiving versus delivering
Node A: Node B:

Application Application

broadcast

Broadcast algorithm Broadcast algorithm


(middleware) (middleware)

Network
Receiving versus delivering
Node A: Node B:

Application Application

broadcast

Broadcast algorithm Broadcast algorithm


(middleware) (middleware)

send receive send receive

Network

Assume network provides point-to-point send/receive


Receiving versus delivering
Node A: Node B:

Application Application

broadcast deliver

Broadcast algorithm Broadcast algorithm


(middleware) (middleware)

send receive send receive

Network

Assume network provides point-to-point send/receive

After broadcast algorithm receives message from network, it


may buffer/queue it before delivering to the application
Forms of reliable broadcast
FIFO broadcast:
If m1 and m2 are broadcast by the same node, and
broadcast(m1 ) → broadcast(m2 ), then m1 must be delivered
before m2
Forms of reliable broadcast
FIFO broadcast:
If m1 and m2 are broadcast by the same node, and
broadcast(m1 ) → broadcast(m2 ), then m1 must be delivered
before m2

Causal broadcast:
If broadcast(m1 ) → broadcast(m2 ) then m1 must be delivered
before m2
Forms of reliable broadcast
FIFO broadcast:
If m1 and m2 are broadcast by the same node, and
broadcast(m1 ) → broadcast(m2 ), then m1 must be delivered
before m2

Causal broadcast:
If broadcast(m1 ) → broadcast(m2 ) then m1 must be delivered
before m2

Total order broadcast:


If m1 is delivered before m2 on one node, then m1 must be
delivered before m2 on all nodes
Forms of reliable broadcast
FIFO broadcast:
If m1 and m2 are broadcast by the same node, and
broadcast(m1 ) → broadcast(m2 ), then m1 must be delivered
before m2

Causal broadcast:
If broadcast(m1 ) → broadcast(m2 ) then m1 must be delivered
before m2

Total order broadcast:


If m1 is delivered before m2 on one node, then m1 must be
delivered before m2 on all nodes

FIFO-total order broadcast:


Combination of FIFO broadcast and total order broadcast
FIFO broadcast
A B C
m1 m1
m1
FIFO broadcast
A B C
m1 m1
m1
m2 m2 m2
FIFO broadcast
A B C
m1 m1
m1
m2 m2 m2

m3 m3
m3
FIFO broadcast
A B C
m1 m1
m1
m2 m2 m2

m3 m3
m3

Messages sent by the same node must be delivered in the


order they were sent.
Messages sent by different nodes can be delivered in any order.
FIFO broadcast
A B C
m1 m1
m1
m2 m2 m2

m3 m3
m3

Messages sent by the same node must be delivered in the


order they were sent.
Messages sent by different nodes can be delivered in any order.
Valid orders: (m2 , m1 , m3 ) or (m1 , m2 , m3 ) or (m1 , m3 , m2 )
FIFO broadcast
A B C
m1 m1
m1
m2 m2 m2

m3 m3
m3

Messages sent by the same node must be delivered in the


order they were sent.
Messages sent by different nodes can be delivered in any order.
Valid orders: (m2 , m1 , m3 ) or (m1 , m2 , m3 ) or (m1 , m3 , m2 )
Causal broadcast
A B C
m1 m1
m1
Causal broadcast
A B C
m1 m1
m1

m2 m2
Causal broadcast
A B C
m1 m1
m1

m3 m2 m2
m3
m3
Causal broadcast
A B C
m1 m1
m1

m3 m2 m2
m3
m3

Causally related messages must be delivered in causal order.


Concurrent messages can be delivered in any order.
Causal broadcast
A B C
m1 m1
m1

m3 m2 m2
m3
m3

Causally related messages must be delivered in causal order.


Concurrent messages can be delivered in any order.
Here: broadcast(m1 ) → broadcast(m2 ) and
broadcast(m1 ) → broadcast(m3 )
=⇒ valid orders are: (m1 , m2 , m3 ) or (m1 , m3 , m2 )
Causal broadcast
A B C
m1 m1
m1

m2 m2
m3 m3

m3

Causally related messages must be delivered in causal order.


Concurrent messages can be delivered in any order.
Here: broadcast(m1 ) → broadcast(m2 ) and
broadcast(m1 ) → broadcast(m3 )
=⇒ valid orders are: (m1 , m2 , m3 ) or (m1 , m3 , m2 )
Total order broadcast (1)

A B C
m1 m1
m1
Total order broadcast (1)

A B C
m1 m1
m1

m2 m2
Total order broadcast (1)

A B C
m1 m1
m1
m3 m2 m2
m3

m3
Total order broadcast (1)

A B C
m1 m1
m1
m3 m2 m2
m3

m3

All nodes must deliver messages in the same order


(here: m1 , m2 , m3 )
Total order broadcast (1)

A B C
m1 m1
m1
m3 m2 m2
m3

m3

All nodes must deliver messages in the same order


(here: m1 , m2 , m3 )

This includes a node’s deliveries to itself!


Total order broadcast (2)
A B C
m1 m1
m1

m2 m2
m3 m3
m3
m2

All nodes must deliver messages in the same order


(here: m1 , m3 , m2 )

This includes a node’s deliveries to itself!


Relationships between broadcast models

FIFO-total order broadcast

Total order
Causal broadcast
broadcast

FIFO broadcast

Reliable broadcast

Best-effort
= stronger than
broadcast
Broadcast algorithms
Break down into two layers:
1. Make best-effort broadcast reliable by retransmitting
dropped messages
2. Enforce delivery order on top of reliable broadcast
Broadcast algorithms
Break down into two layers:
1. Make best-effort broadcast reliable by retransmitting
dropped messages
2. Enforce delivery order on top of reliable broadcast

First attempt: broadcasting node sends message directly


to every other node
I Use reliable links (retry + deduplicate)
Broadcast algorithms
Break down into two layers:
1. Make best-effort broadcast reliable by retransmitting
dropped messages
2. Enforce delivery order on top of reliable broadcast

First attempt: broadcasting node sends message directly


to every other node
I Use reliable links (retry + deduplicate)
I Problem: node may crash before all messages delivered

A B C
m1
m1
Eager reliable broadcast
Idea: the first time a node receives a particular message, it
re-broadcasts to each other node (via reliable links).

A B C
Eager reliable broadcast
Idea: the first time a node receives a particular message, it
re-broadcasts to each other node (via reliable links).

A B C
m1
m1
Eager reliable broadcast
Idea: the first time a node receives a particular message, it
re-broadcasts to each other node (via reliable links).

A B C
m1
m1
m1 m1
Eager reliable broadcast
Idea: the first time a node receives a particular message, it
re-broadcasts to each other node (via reliable links).

A B C
m1
m1
m1 m1
m1
m1
Eager reliable broadcast
Idea: the first time a node receives a particular message, it
re-broadcasts to each other node (via reliable links).

A B C
m1
m1
m1 m1
m1
m1

Reliable, but. . . up to O(n2 ) messages for n nodes!


Gossip protocols
Useful when broadcasting to a large number of nodes.
Idea: when a node receives a message for the first time,
forward it to 3 other nodes, chosen randomly.
Gossip protocols
Useful when broadcasting to a large number of nodes.
Idea: when a node receives a message for the first time,
forward it to 3 other nodes, chosen randomly.
Gossip protocols
Useful when broadcasting to a large number of nodes.
Idea: when a node receives a message for the first time,
forward it to 3 other nodes, chosen randomly.
Gossip protocols
Useful when broadcasting to a large number of nodes.
Idea: when a node receives a message for the first time,
forward it to 3 other nodes, chosen randomly.
Gossip protocols
Useful when broadcasting to a large number of nodes.
Idea: when a node receives a message for the first time,
forward it to 3 other nodes, chosen randomly.
Gossip protocols
Useful when broadcasting to a large number of nodes.
Idea: when a node receives a message for the first time,
forward it to 3 other nodes, chosen randomly.
Gossip protocols
Useful when broadcasting to a large number of nodes.
Idea: when a node receives a message for the first time,
forward it to 3 other nodes, chosen randomly.

Eventually reaches all nodes (with high probability).


FIFO broadcast algorithm

on initialisation do
sendSeq := 0; delivered := h0, 0, . . . , 0i; buffer := {}
end on

on request to broadcast m at node Ni do


send (i, sendSeq, m) via reliable broadcast
sendSeq := sendSeq + 1
end on

on receiving msg from reliable broadcast at node Ni do


buffer := buffer ∪ {msg}
while ∃sender , m. (sender , delivered [sender ], m) ∈ buffer do
deliver m to the application
delivered [sender ] := delivered [sender ] + 1
end while
end on
Causal broadcast algorithm
on initialisation do
sendSeq := 0; delivered := h0, 0, . . . , 0i; buffer := {}
end on

on request to broadcast m at node Ni do


deps := delivered ; deps[i] := sendSeq
send (i, deps, m) via reliable broadcast
sendSeq := sendSeq + 1
end on

on receiving msg from reliable broadcast at node Ni do


buffer := buffer ∪ {msg}
while ∃(sender , deps, m) ∈ buffer . deps ≤ delivered do
deliver m to the application
buffer := buffer \ {(sender , deps, m)}
delivered [sender ] := delivered [sender ] + 1
end while
end on
Vector clocks ordering
Define the following order on vector timestamps
(in a system with n nodes):
I T = T 0 iff T [i] = T 0 [i] for all i ∈ {1, . . . , n}
I T ≤ T 0 iff T [i] ≤ T 0 [i] for all i ∈ {1, . . . , n}
I T < T 0 iff T ≤ T 0 and T 6= T 0
I T k T 0 iff T 6≤ T 0 and T 0 6≤ T
Total order broadcast algorithms
Single leader approach:
I One node is designated as leader (sequencer)
I To broadcast message, send it to the leader;
leader broadcasts it via FIFO broadcast.
Total order broadcast algorithms
Single leader approach:
I One node is designated as leader (sequencer)
I To broadcast message, send it to the leader;
leader broadcasts it via FIFO broadcast.
I Problem: leader crashes =⇒ no more messages delivered
I Changing the leader safely is difficult
Total order broadcast algorithms
Single leader approach:
I One node is designated as leader (sequencer)
I To broadcast message, send it to the leader;
leader broadcasts it via FIFO broadcast.
I Problem: leader crashes =⇒ no more messages delivered
I Changing the leader safely is difficult

Lamport clocks approach:


I Attach Lamport timestamp to every message
I Deliver messages in total order of timestamps
Total order broadcast algorithms
Single leader approach:
I One node is designated as leader (sequencer)
I To broadcast message, send it to the leader;
leader broadcasts it via FIFO broadcast.
I Problem: leader crashes =⇒ no more messages delivered
I Changing the leader safely is difficult

Lamport clocks approach:


I Attach Lamport timestamp to every message
I Deliver messages in total order of timestamps
I Problem: how do you know if you have seen all messages
with timestamp < T ? Need to use FIFO links and wait
for message with timestamp ≥ T from every node
Lecture 5

Replication
Replication
I Keeping a copy of the same data on multiple nodes
I Databases, filesystems, caches, . . .
I A node that has a copy of the data is called a replica
Replication
I Keeping a copy of the same data on multiple nodes
I Databases, filesystems, caches, . . .
I A node that has a copy of the data is called a replica
I If some replicas are faulty, others are still accessible
I Spread load across many replicas
Replication
I Keeping a copy of the same data on multiple nodes
I Databases, filesystems, caches, . . .
I A node that has a copy of the data is called a replica
I If some replicas are faulty, others are still accessible
I Spread load across many replicas
I Easy if the data doesn’t change: just copy it
I We will focus on data changes
Replication
I Keeping a copy of the same data on multiple nodes
I Databases, filesystems, caches, . . .
I A node that has a copy of the data is called a replica
I If some replicas are faulty, others are still accessible
I Spread load across many replicas
I Easy if the data doesn’t change: just copy it
I We will focus on data changes

Compare to RAID (Redundant Array of Independent Disks):


replication within a single computer
I RAID has single controller; in distributed system, each
node acts independently
I Replicas can be distributed around the world, near users
Retrying state updates
User A: The moon is not actually made of cheese!
Like 12,300 people like this.

client
Retrying state updates
User A: The moon is not actually made of cheese!
Like 12,300 people like this.

client
increment post.likes
ack 12,301
Retrying state updates
User A: The moon is not actually made of cheese!
Like 12,300 people like this.

client
increment post.likes
ack 12,301

increment post.likes
ack 12,302
Retrying state updates
User A: The moon is not actually made of cheese!
Like 12,300 people like this.

client
increment post.likes
ack 12,301

increment post.likes
ack 12,302

Deduplicating requests requires that the database tracks which


requests it has already seen (in stable storage)
Idempotence

A function f is idempotent if f (x) = f (f (x)).


I Not idempotent: f (likeCount) = likeCount + 1
I Idempotent: f (likeSet) = likeSet ∪ {userID}
Idempotent requests can be retried without deduplication.
Idempotence

A function f is idempotent if f (x) = f (f (x)).


I Not idempotent: f (likeCount) = likeCount + 1
I Idempotent: f (likeSet) = likeSet ∪ {userID}
Idempotent requests can be retried without deduplication.

Choice of retry behaviour:


I At-most-once semantics:
send request, don’t retry, update may not happen
I At-least-once semantics:
retry request until acknowledged, may repeat update
I Exactly-once semantics:
retry + idempotence or deduplication
Adding and then removing again

client 1 client 2
Adding and then removing again

client 1 client 2
f : add like

ack

f (likes) = likes ∪ {userID}


Adding and then removing again

client 1 client 2
f : add like

ack
set of likes

f (likes) = likes ∪ {userID}


Adding and then removing again

client 1 client 2
f : add like

ack
set of likes
g : unlike
ack

f (likes) = likes ∪ {userID}


g(likes) = likes \ {userID}
Adding and then removing again

client 1 client 2
f : add like

ack
set of likes
g : unlike
f : add like ack
ack

f (likes) = likes ∪ {userID}


g(likes) = likes \ {userID}
Adding and then removing again

client 1 client 2
f : add like

ack
set of likes
g : unlike
f : add like ack
ack

f (likes) = likes ∪ {userID}


g(likes) = likes \ {userID}
Idempotent? f (f (x)) = f (x) but f (g(f (x)) 6= g(f (x))
Another problem with adding and removing
client A B
Another problem with adding and removing
client A B
add(x)
add(x)
Another problem with adding and removing
client A B
add(x)
add(x)
remove(x)
remove(x)
Another problem with adding and removing
client A B
add(x)
add(x)
remove(x)
remove(x)

Final state (x ∈
/ A, x ∈ B) is the same as in this case:
Another problem with adding and removing
client A B
add(x)
add(x)
remove(x)
remove(x)

Final state (x ∈
/ A, x ∈ B) is the same as in this case:

client A B
add(x)
add(x)
Timestamps and tombstones

client A B
(t1 , add(x))
t1
(t1 , add(x))
{x 7→ (t1 , true)}
{x 7→ (t1 , true)}
Timestamps and tombstones

client A B
(t1 , add(x))
t1
(t1 , add(x))
{x 7→ (t1 , true)}
{x 7→ (t1 , true)}
(t2 , remove(x
t2 ))
(t2 , remove(x
))
{x 7→ (t2 , false)}
Timestamps and tombstones

client A B
(t1 , add(x))
t1
(t1 , add(x))
{x 7→ (t1 , true)}
{x 7→ (t1 , true)}
(t2 , remove(x
t2 ))
(t2 , remove(x
))
{x 7→ (t2 , false)}

“remove(x)” doesn’t actually remove x: it labels x with


“false” to indicate it is invisible (a tombstone)
Timestamps and tombstones

client A B
(t1 , add(x))
t1
(t1 , add(x))
{x 7→ (t1 , true)}
{x 7→ (t1 , true)}
(t2 , remove(x
t2 ))
(t2 , remove(x
))
{x 7→ (t2 , false)}

“remove(x)” doesn’t actually remove x: it labels x with


“false” to indicate it is invisible (a tombstone)
Every record has logical timestamp of last write
Reconciling replicas
Replicas periodically communicate among themselves
to check for any inconsistencies.

A B

{x 7→ (t2 , false)} {x 7→ (t1 , true)}


Reconciling replicas
Replicas periodically communicate among themselves
to check for any inconsistencies.

A B

reconcile state
{x 7→ (t2 , false)} {x 7→ (t1 , true)}
(anti-entropy)
Reconciling replicas
Replicas periodically communicate among themselves
to check for any inconsistencies.

A B

reconcile state
{x 7→ (t2 , false)} {x 7→ (t1 , true)}
(anti-entropy)

{x 7→ (t2 , false)} t1 < t2 {x 7→ (t2 , false)}


Reconciling replicas
Replicas periodically communicate among themselves
to check for any inconsistencies.

A B

reconcile state
{x 7→ (t2 , false)} {x 7→ (t1 , true)}
(anti-entropy)

{x 7→ (t2 , false)} t1 < t2 {x 7→ (t2 , false)}

Propagate the record with the latest timestamp,


discard the records with earlier timestamps
(for a given key).
Concurrent writes by different clients

client 1 A B client 2
Concurrent writes by different clients

client 1 A B client 2
(t1 , set
t1 (x, v )
1 )
Concurrent writes by different clients

client 1 A B client 2
(t1 , set )
(x, v ) (x, v2)
t1 1 ) (t2, set t2
Concurrent writes by different clients

client 1 A B client 2
(t1 , set )
(x, v ) (x, v2)
t1 1 ) (t2, set t2

Two common approaches:


I Last writer wins (LWW):
Use timestamps with total order (e.g. Lamport clock)
Keep v2 and discard v1 if t2 > t1 . Note: data loss!
Concurrent writes by different clients

client 1 A B client 2
(t1 , set )
(x, v ) (x, v2)
t1 1 ) (t2, set t2

Two common approaches:


I Last writer wins (LWW):
Use timestamps with total order (e.g. Lamport clock)
Keep v2 and discard v1 if t2 > t1 . Note: data loss!
I Multi-value register:
Use timestamps with partial order (e.g. vector clock)
v2 replaces v1 if t2 > t1 ; preserve both {v1 , v2 } if t1 k t2
Probability of faults
A replica may be unavailable due to network partition or
node fault (e.g. crash, hardware problem).
Probability of faults
A replica may be unavailable due to network partition or
node fault (e.g. crash, hardware problem).

Assume each replica has probability p of being faulty or


unavailable at any one time, and that faults are independent.
(Not actually true! But okay approximation for now.)
Probability of faults
A replica may be unavailable due to network partition or
node fault (e.g. crash, hardware problem).

Assume each replica has probability p of being faulty or


unavailable at any one time, and that faults are independent.
(Not actually true! But okay approximation for now.)

Probability of all n replicas being faulty: pn


Probability of ≥ 1 out of n replicas being faulty: 1 − (1 − p)n
Probability of faults
A replica may be unavailable due to network partition or
node fault (e.g. crash, hardware problem).

Assume each replica has probability p of being faulty or


unavailable at any one time, and that faults are independent.
(Not actually true! But okay approximation for now.)

Probability of all n replicas being faulty: pn


Probability of ≥ 1 out of n replicas being faulty: 1 − (1 − p)n

Example with p = 0.01:


n+1
replicas n P (≥ 1 faulty) P (≥ 2
faulty) P (all n faulty)
1 0.01 0.01 0.01
3 0.03 3 · 10−4 10−6
5 0.049 1 · 10−5 10−10
100 0.63 6 · 10−74 10−200
Read-after-write consistency

client A B
(t1 , set(x, v ))
t1 1
Read-after-write consistency

client A B
(t1 , set(x, v ))
t1 1

get(x)

(t0, v0)
Read-after-write consistency

client A B
(t1 , set(x, v ))
t1 1

get(x)

(t0, v0)

Writing to one replica, reading from another: client does not


read back the value it has written
Read-after-write consistency

client A B
(t1 , set(x, v ))
t1 1

get(x)

(t0, v0)

Writing to one replica, reading from another: client does not


read back the value it has written

Require writing to/reading from both replicas =⇒ cannot


write/read if one replica is unavailable
Quorum (2 out of 3)
client A B C
Quorum (2 out of 3)
client A B C
(t1 , set(x, v ))
t1 1
Quorum (2 out of 3)
client A B C
(t1 , set(x, v ))
t1 1

ok ok

Write succeeds on B and C


Quorum (2 out of 3)
client A B C
(t1 , set(x, v ))
t1 1

ok ok

get(x)

Write succeeds on B and C


Quorum (2 out of 3)
client A B C
(t1 , set(x, v ))
t1 1

ok ok

get(x)

(t0, v0) (t1, v1)

Write succeeds on B and C; read succeeds on A and B


Quorum (2 out of 3)
client A B C
(t1 , set(x, v ))
t1 1

ok ok

get(x)

(t0, v0) (t1, v1)

Write succeeds on B and C; read succeeds on A and B


Choose between (t0 , v0 ) and (t1 , v1 ) based on timestamp
Read and write quorums
In a system with n replicas:
I If a write is acknowledged by w replicas (write quorum),
Read and write quorums
In a system with n replicas:
I If a write is acknowledged by w replicas (write quorum),
I and we subsequently read from r replicas (read quorum),
I and r + w > n,
Read and write quorums
In a system with n replicas:
I If a write is acknowledged by w replicas (write quorum),
I and we subsequently read from r replicas (read quorum),
I and r + w > n,
I . . . then the read will see the previously written value
(or a value that subsequently overwrote it)
Read and write quorums
In a
system with n replicas:
IIf a write is acknowledged by w replicas (write quorum),
Iand we subsequently read from r replicas (read quorum),
Iand r + w > n,
I. . . then the read will see the previously written value
(or a value that subsequently overwrote it)
I Read quorum and write quorum share ≥ 1 replica

A B C D E

read quorum write quorum


Read and write quorums
In a
system with n replicas:
IIf a write is acknowledged by w replicas (write quorum),
Iand we subsequently read from r replicas (read quorum),
Iand r + w > n,
I. . . then the read will see the previously written value
(or a value that subsequently overwrote it)
I Read quorum and write quorum share ≥ 1 replica
I Typical: r = w = n+1 2
for n = 3, 5, 7, . . . (majority)
I Reads can tolerate n − r unavailable replicas, writes n − w

A B C D E

read quorum write quorum


Read repair

client A B C
Read repair

client A B C
get(x)
Read repair

client A B C
get(x)

(t0, v0) (t1, v1)

Update (t1 , v1 ) is more recent than (t0 , v0 ) since t0 < t1 .


Read repair

client A B C
get(x)

(t0, v0) (t1, v1)

(t1 , set(x, v ))
1

Update (t1 , v1 ) is more recent than (t0 , v0 ) since t0 < t1 .


Client helps propagate (t1 , v1 ) to other replicas.
State machine replication
So far we have used best-effort broadcast for replication.
What about stronger broadcast models?
State machine replication
So far we have used best-effort broadcast for replication.
What about stronger broadcast models?

Total order broadcast: every node delivers the same


messages in the same order
State machine replication
So far we have used best-effort broadcast for replication.
What about stronger broadcast models?

Total order broadcast: every node delivers the same


messages in the same order

State machine replication (SMR):


I FIFO-total order broadcast every update to all replicas
I Replica delivers update message: apply it to own state
State machine replication
So far we have used best-effort broadcast for replication.
What about stronger broadcast models?

Total order broadcast: every node delivers the same


messages in the same order

State machine replication (SMR):


I FIFO-total order broadcast every update to all replicas
I Replica delivers update message: apply it to own state
I Applying an update is deterministic
State machine replication
So far we have used best-effort broadcast for replication.
What about stronger broadcast models?

Total order broadcast: every node delivers the same


messages in the same order

State machine replication (SMR):


I FIFO-total order broadcast every update to all replicas
I Replica delivers update message: apply it to own state
I Applying an update is deterministic
I Replica is a state machine: starts in fixed initial state,
goes through same sequence of state transitions in the
same order =⇒ all replicas end up in the same state
State machine replication
on request to perform update u do
send u via FIFO-total order broadcast
end on

on delivering u through FIFO-total order broadcast do


update state using arbitrary deterministic logic!
end on
State machine replication
on request to perform update u do
send u via FIFO-total order broadcast
end on

on delivering u through FIFO-total order broadcast do


update state using arbitrary deterministic logic!
end on
Closely related ideas:
I Serializable transactions (execute in delivery order)
State machine replication
on request to perform update u do
send u via FIFO-total order broadcast
end on

on delivering u through FIFO-total order broadcast do


update state using arbitrary deterministic logic!
end on
Closely related ideas:
I Serializable transactions (execute in delivery order)
I Blockchains, distributed ledgers, smart contracts
State machine replication
on request to perform update u do
send u via FIFO-total order broadcast
end on

on delivering u through FIFO-total order broadcast do


update state using arbitrary deterministic logic!
end on
Closely related ideas:
I Serializable transactions (execute in delivery order)
I Blockchains, distributed ledgers, smart contracts
Limitations:
I Cannot update state immediately, have to wait for
delivery through broadcast
State machine replication
on request to perform update u do
send u via FIFO-total order broadcast
end on

on delivering u through FIFO-total order broadcast do


update state using arbitrary deterministic logic!
end on
Closely related ideas:
I Serializable transactions (execute in delivery order)
I Blockchains, distributed ledgers, smart contracts
Limitations:
I Cannot update state immediately, have to wait for
delivery through broadcast
I Need fault-tolerant total order broadcast: see lecture 6
Database leader replica
Leader database replica L ensures total order broadcast

client 1 client 2 L F
Database leader replica
Leader database replica L ensures total order broadcast

client 1 client 2 L F
T1

T1
Database leader replica
Leader database replica L ensures total order broadcast

client 1 client 2 L F
T1

T2 T1

T2
Database leader replica
Leader database replica L ensures total order broadcast

client 1 client 2 L F
T1

T2 T1

T2
ok commit

Follower F applies transaction log in commit order


Database leader replica
Leader database replica L ensures total order broadcast

client 1 client 2 L F
T1

T2 T1

T2
ok commit

ok commit

Follower F applies transaction log in commit order


Replication using causal (and weaker) broadcast
State machine replication uses (FIFO-)total order broadcast.
Can we use weaker forms of broadcast too?
Replication using causal (and weaker) broadcast
State machine replication uses (FIFO-)total order broadcast.
Can we use weaker forms of broadcast too?

If replica state updates are commutative, replicas can process


updates in different orders and still end up in the same state.

Updates f and g are commutative if f (g(x)) = g(f (x))


Replication using causal (and weaker) broadcast
State machine replication uses (FIFO-)total order broadcast.
Can we use weaker forms of broadcast too?

If replica state updates are commutative, replicas can process


updates in different orders and still end up in the same state.

Updates f and g are commutative if f (g(x)) = g(f (x))

broadcast assumptions about state update function


total order deterministic (SMR)
Replication using causal (and weaker) broadcast
State machine replication uses (FIFO-)total order broadcast.
Can we use weaker forms of broadcast too?

If replica state updates are commutative, replicas can process


updates in different orders and still end up in the same state.

Updates f and g are commutative if f (g(x)) = g(f (x))

broadcast assumptions about state update function


total order deterministic (SMR)
causal deterministic, concurrent updates commute
Replication using causal (and weaker) broadcast
State machine replication uses (FIFO-)total order broadcast.
Can we use weaker forms of broadcast too?

If replica state updates are commutative, replicas can process


updates in different orders and still end up in the same state.

Updates f and g are commutative if f (g(x)) = g(f (x))

broadcast assumptions about state update function


total order deterministic (SMR)
causal deterministic, concurrent updates commute
reliable deterministic, all updates commute
Replication using causal (and weaker) broadcast
State machine replication uses (FIFO-)total order broadcast.
Can we use weaker forms of broadcast too?

If replica state updates are commutative, replicas can process


updates in different orders and still end up in the same state.

Updates f and g are commutative if f (g(x)) = g(f (x))

broadcast assumptions about state update function


total order deterministic (SMR)
causal deterministic, concurrent updates commute
reliable deterministic, all updates commute
best-effort deterministic, commutative, idempotent,
tolerates message loss
Lecture 6

Consensus
Fault-tolerant total order broadcast
Total order broadcast is very useful for state machine
replication.
Can implement total order broadcast by sending all messages
via a single leader.
Problem: what if leader crashes/becomes unavailable?
Fault-tolerant total order broadcast
Total order broadcast is very useful for state machine
replication.
Can implement total order broadcast by sending all messages
via a single leader.
Problem: what if leader crashes/becomes unavailable?

I Manual failover: a human operator chooses a new


leader, and reconfigures each node to use new leader
Used in many databases! Fine for planned maintenance.
Fault-tolerant total order broadcast
Total order broadcast is very useful for state machine
replication.
Can implement total order broadcast by sending all messages
via a single leader.
Problem: what if leader crashes/becomes unavailable?

I Manual failover: a human operator chooses a new


leader, and reconfigures each node to use new leader
Used in many databases! Fine for planned maintenance.
Unplanned outage? Humans are slow, may take a long
time until system recovers. . .
Fault-tolerant total order broadcast
Total order broadcast is very useful for state machine
replication.
Can implement total order broadcast by sending all messages
via a single leader.
Problem: what if leader crashes/becomes unavailable?

I Manual failover: a human operator chooses a new


leader, and reconfigures each node to use new leader
Used in many databases! Fine for planned maintenance.
Unplanned outage? Humans are slow, may take a long
time until system recovers. . .
I Can we automatically choose a new leader?
Consensus and total order broadcast
I Traditional formulation of consensus: several nodes want
to come to agreement about a single value
Consensus and total order broadcast
I Traditional formulation of consensus: several nodes want
to come to agreement about a single value
I In context of total order broadcast: this value is the next
message to deliver
Consensus and total order broadcast
I Traditional formulation of consensus: several nodes want
to come to agreement about a single value
I In context of total order broadcast: this value is the next
message to deliver
I Once one node decides on a certain message order, all
nodes will decide the same order
Consensus and total order broadcast
I Traditional formulation of consensus: several nodes want
to come to agreement about a single value
I In context of total order broadcast: this value is the next
message to deliver
I Once one node decides on a certain message order, all
nodes will decide the same order
I Consensus and total order broadcast are formally
equivalent
Consensus and total order broadcast
I Traditional formulation of consensus: several nodes want
to come to agreement about a single value
I In context of total order broadcast: this value is the next
message to deliver
I Once one node decides on a certain message order, all
nodes will decide the same order
I Consensus and total order broadcast are formally
equivalent
Common consensus algorithms:
I Paxos: single-value consensus
Multi-Paxos: generalisation to total order broadcast
Consensus and total order broadcast
I Traditional formulation of consensus: several nodes want
to come to agreement about a single value
I In context of total order broadcast: this value is the next
message to deliver
I Once one node decides on a certain message order, all
nodes will decide the same order
I Consensus and total order broadcast are formally
equivalent
Common consensus algorithms:
I Paxos: single-value consensus
Multi-Paxos: generalisation to total order broadcast
I Raft, Viewstamped Replication, Zab:
FIFO-total order broadcast by default
Consensus system models
Paxos, Raft, etc. assume a partially synchronous,
crash-recovery system model.
Consensus system models
Paxos, Raft, etc. assume a partially synchronous,
crash-recovery system model.

Why not asynchronous?


I FLP result (Fischer, Lynch, Paterson):
There is no deterministic consensus algorithm that is
guaranteed to terminate in an asynchronous crash-stop
system model.
Consensus system models
Paxos, Raft, etc. assume a partially synchronous,
crash-recovery system model.

Why not asynchronous?


I FLP result (Fischer, Lynch, Paterson):
There is no deterministic consensus algorithm that is
guaranteed to terminate in an asynchronous crash-stop
system model.
I Paxos, Raft, etc. use clocks only used for timeouts/failure
detector to ensure progress. Safety (correctness) does not
depend on timing.
Consensus system models
Paxos, Raft, etc. assume a partially synchronous,
crash-recovery system model.

Why not asynchronous?


I FLP result (Fischer, Lynch, Paterson):
There is no deterministic consensus algorithm that is
guaranteed to terminate in an asynchronous crash-stop
system model.
I Paxos, Raft, etc. use clocks only used for timeouts/failure
detector to ensure progress. Safety (correctness) does not
depend on timing.

There are also consensus algorithms for a partially synchronous


Byzantine system model (used in blockchains)
Leader election
Multi-Paxos, Raft, etc. use a leader to sequence messages.
Leader election
Multi-Paxos, Raft, etc. use a leader to sequence messages.
I Use a failure detector (timeout) to determine suspected
crash or unavailability of leader.
I On suspected leader crash, elect a new one.
Leader election
Multi-Paxos, Raft, etc. use a leader to sequence messages.
I Use a failure detector (timeout) to determine suspected
crash or unavailability of leader.
I On suspected leader crash, elect a new one.
I Prevent two leaders at the same time (“split-brain”)!
Leader election
Multi-Paxos, Raft, etc. use a leader to sequence messages.
I Use a failure detector (timeout) to determine suspected
crash or unavailability of leader.
I On suspected leader crash, elect a new one.
I Prevent two leaders at the same time (“split-brain”)!
Ensure ≤ 1 leader per term:
I Term is incremented every time a leader election is started
Leader election
Multi-Paxos, Raft, etc. use a leader to sequence messages.
I Use a failure detector (timeout) to determine suspected
crash or unavailability of leader.
I On suspected leader crash, elect a new one.
I Prevent two leaders at the same time (“split-brain”)!
Ensure ≤ 1 leader per term:
I Term is incremented every time a leader election is started
I A node can only vote once per term
I Require a quorum of nodes to elect a leader in a term

A B C D E
Leader election
Multi-Paxos, Raft, etc. use a leader to sequence messages.
I Use a failure detector (timeout) to determine suspected
crash or unavailability of leader.
I On suspected leader crash, elect a new one.
I Prevent two leaders at the same time (“split-brain”)!
Ensure ≤ 1 leader per term:
I Term is incremented every time a leader election is started
I A node can only vote once per term
I Require a quorum of nodes to elect a leader in a term

A B C D E

elects a leader
Leader election
Multi-Paxos, Raft, etc. use a leader to sequence messages.
I Use a failure detector (timeout) to determine suspected
crash or unavailability of leader.
I On suspected leader crash, elect a new one.
I Prevent two leaders at the same time (“split-brain”)!
Ensure ≤ 1 leader per term:
I Term is incremented every time a leader election is started
I A node can only vote once per term
I Require a quorum of nodes to elect a leader in a term

A B C D E

elects a leader cannot elect a different leader


because C already voted
Can we guarantee there is only one leader?
Can guarantee unique leader per term.
Can we guarantee there is only one leader?
Can guarantee unique leader per term.

Cannot prevent having multiple leaders from different terms.


Can we guarantee there is only one leader?
Can guarantee unique leader per term.

Cannot prevent having multiple leaders from different terms.

Example: node 1 is leader in term t, but due to a network


partition it can no longer communicate with nodes 2 and 3:

node 1 node 2 node 3

Nodes 2 and 3 may elect a new leader in term t + 1.


Can we guarantee there is only one leader?
Can guarantee unique leader per term.

Cannot prevent having multiple leaders from different terms.

Example: node 1 is leader in term t, but due to a network


partition it can no longer communicate with nodes 2 and 3:

node 1 node 2 node 3

Nodes 2 and 3 may elect a new leader in term t + 1.

Node 1 may not even know that a new leader has been elected!
Checking if a leader has been voted out
For every decision (message to deliver), the leader must first
get acknowledgements from a quorum.

leader follower 1 follower 2


Checking if a leader has been voted out
For every decision (message to deliver), the leader must first
get acknowledgements from a quorum.

leader follower 1 follower 2


Shall I be yo
ur leader in te
rm t?
Checking if a leader has been voted out
For every decision (message to deliver), the leader must first
get acknowledgements from a quorum.

leader follower 1 follower 2


Shall I be yo
ur leader in te
rm t?
yes yes
Checking if a leader has been voted out
For every decision (message to deliver), the leader must first
get acknowledgements from a quorum.

leader follower 1 follower 2


Shall I be yo
ur leader in te
rm t?
yes yes

Can we deliv
er message m
next in term
t?
Checking if a leader has been voted out
For every decision (message to deliver), the leader must first
get acknowledgements from a quorum.

leader follower 1 follower 2


Shall I be yo
ur leader in te
rm t?
yes yes

Can we deliv
er message m
next in term
t?
okay okay
Checking if a leader has been voted out
For every decision (message to deliver), the leader must first
get acknowledgements from a quorum.

leader follower 1 follower 2


Shall I be yo
ur leader in te
rm t?
yes yes

Can we deliv
er message m
next in term
t?
okay okay

Right, now d
eliver m plea
se
Node state transitions in Raft

Follower Candidate Leader


Node state transitions in Raft

starts up
or recovers
from crash

Follower Candidate Leader


Node state transitions in Raft

starts up
or recovers
from crash
suspects
leader failure

Follower Candidate Leader


Node state transitions in Raft

starts up
or recovers
from crash
suspects
leader failure
receives votes
from quorum
Follower Candidate Leader
Node state transitions in Raft

starts up
or recovers
from crash
suspects
leader failure
receives votes
from quorum
Follower Candidate Leader
discovers
new term
Node state transitions in Raft

starts up
or recovers election
from crash times out
suspects
leader failure
receives votes
from quorum
Follower Candidate Leader
discovers
new term
Node state transitions in Raft

starts up
or recovers election
from crash times out
suspects
leader failure
receives votes
from quorum
Follower Candidate Leader
discovers
new term

discovers new term


Raft (1/9): initialisation
on initialisation do
currentTerm := 0; votedFor := null
log := hi; commitLength := 0
currentRole := follower; currentLeader := null
votesReceived := {}; sentLength := hi; ackedLength := hi
end on

on recovery from crash do


currentRole := follower; currentLeader := null
votesReceived := {}; sentLength := hi; ackedLength := hi
end on

on node nodeId suspects leader has failed, or on election timeout do


currentTerm := currentTerm + 1; currentRole := candidate
votedFor := nodeId ; votesReceived := {nodeId }; lastTerm := 0
if log.length > 0 then lastTerm := log[log.length − 1].term; end if
msg := (VoteRequest, nodeId , currentTerm, log.length, lastTerm)
for each node ∈ nodes: send msg to node
start election timer
end on
m1 m2 m3 msg
Raft (1/9): initialisation log =
1 1 1 term
on initialisation do
currentTerm := 0; votedFor := null log[0] log[1] log[2]
log := hi; commitLength := 0
currentRole := follower; currentLeader := null
votesReceived := {}; sentLength := hi; ackedLength := hi
end on

on recovery from crash do


currentRole := follower; currentLeader := null
votesReceived := {}; sentLength := hi; ackedLength := hi
end on

on node nodeId suspects leader has failed, or on election timeout do


currentTerm := currentTerm + 1; currentRole := candidate
votedFor := nodeId ; votesReceived := {nodeId }; lastTerm := 0
if log.length > 0 then lastTerm := log[log.length − 1].term; end if
msg := (VoteRequest, nodeId , currentTerm, log.length, lastTerm)
for each node ∈ nodes: send msg to node
start election timer
end on
Raft (2/9): voting on a new leader
on receiving (VoteRequest, cId , cTerm, cLogLength, cLogTerm)
at node nodeId do
myLogTerm := log[log.length − 1].term
logOk := (cLogTerm > myLogTerm) ∨
(cLogTerm = myLogTerm ∧ cLogLength ≥ log.length)

termOk := (cTerm > currentTerm) ∨


(cTerm = currentTerm ∧ votedFor ∈ {cId , null})

if logOk ∧ termOk then


currentTerm := cTerm
currentRole := follower
votedFor := cId
send (VoteResponse, nodeId , currentTerm, true) to node cId
else
send (VoteResponse, nodeId , currentTerm, false) to node cId
end if
end on
Raft (2/9): voting on a new leader c for candidate

on receiving (VoteRequest, cId , cTerm, cLogLength, cLogTerm)


at node nodeId do
myLogTerm := log[log.length − 1].term
logOk := (cLogTerm > myLogTerm) ∨
(cLogTerm = myLogTerm ∧ cLogLength ≥ log.length)

termOk := (cTerm > currentTerm) ∨


(cTerm = currentTerm ∧ votedFor ∈ {cId , null})

if logOk ∧ termOk then


currentTerm := cTerm
currentRole := follower
votedFor := cId
send (VoteResponse, nodeId , currentTerm, true) to node cId
else
send (VoteResponse, nodeId , currentTerm, false) to node cId
end if
end on
Raft (3/9): collecting votes
on receiving (VoteResponse, voterId , term, granted ) at nodeId do
if currentRole = candidate ∧ term = currentTerm ∧ granted then
votesReceived := votesReceived ∪ {voterId }
if |votesReceived | ≥ d(|nodes| + 1)/2e then
currentRole := leader; currentLeader := nodeId
cancel election timer
for each follower ∈ nodes \ {nodeId } do
sentLength[follower ] := log.length
ackedLength[follower ] := 0
ReplicateLog(nodeId , follower )
end for
end if
else if term > currentTerm then
currentTerm := term
currentRole := follower
votedFor := null
cancel election timer
end if
end on
Raft (4/9): broadcasting messages
on request to broadcast msg at node nodeId do
if currentRole = leader then
append the record (msg : msg, term : currentTerm) to log
ackedLength[nodeId ] := log.length
for each follower ∈ nodes \ {nodeId } do
ReplicateLog(nodeId , follower )
end for
else
forward the request to currentLeader via a FIFO link
end if
end on

periodically at node nodeId do


if currentRole = leader then
for each follower ∈ nodes \ {nodeId } do
ReplicateLog(nodeId , follower )
end for
end if
end do
Raft (5/9): replicating from leader to followers
Called on the leader whenever there is a new message in the log, and also
periodically. If there are no new messages, entries is the empty list.
LogRequest messages with entries = hi serve as heartbeats, letting
followers know that the leader is still alive.

function ReplicateLog(leaderId , followerId )


i := sentLength[followerId ]
entries := hlog[i], log[i + 1], . . . , log[log.length − 1]i
prevLogTerm := 0
if i > 0 then
prevLogTerm := log[i − 1].term
end if
send (LogRequest, leaderId , currentTerm, i, prevLogTerm,
commitLength, entries) to followerId
end function
Raft (6/9): followers receiving messages
on receiving (LogRequest, leaderId , term, logLength, logTerm,
leaderCommit, entries) at node nodeId do
if term > currentTerm then
currentTerm := term; votedFor := null
end if
logOk := (log.length ≥ logLength)
if logOk ∧ (logLength > 0) then
logOk := (logTerm = log[logLength − 1].term)
end if

if term = currentTerm ∧ logOk then


currentRole := follower; currentLeader := leaderId
AppendEntries(logLength, leaderCommit, entries)
ack := logLength + entries.length
send (LogResponse, nodeId , currentTerm, ack , true) to leaderId
else
send (LogResponse, nodeId , currentTerm, 0, false) to leaderId
end if
end on
Raft (7/9): updating followers’ logs
function AppendEntries(logLength, leaderCommit, entries)
if entries.length > 0 ∧ log.length > logLength then
if log[logLength].term 6= entries[0].term then
log := hlog[0], log[1], . . . , log[logLength − 1]i
end if
end if
if logLength + entries.length > log.length then
for i := log.length − logLength to entries.length − 1 do
append entries[i] to log
end for
end if
if leaderCommit > commitLength then
for i := commitLength to leaderCommit − 1 do
deliver log[i].msg to the application
end for
commitLength := leaderCommit
end if
end function
Raft (8/9): leader receiving log acknowledgements
on receiving (LogResponse, follower , term, ack , success) at nodeId do
if term = currentTerm ∧ currentRole = leader then
if success = true ∧ ack ≥ ackedLength[follower ] then
sentLength[follower ] := ack
ackedLength[follower ] := ack
CommitLogEntries()
else if sentLength[follower ] > 0 then
sentLength[follower ] := sentLength[follower ] − 1
ReplicateLog(nodeId , follower )
end if
else if term > currentTerm then
currentTerm := term
currentRole := follower
votedFor := null
end if
end on
Raft (9/9): leader committing log entries
Any log entries that have been acknowledged by a quorum of nodes are
ready to be committed by the leader. When a log entry is committed, its
message is delivered to the application.

define acks(length) = |{n ∈ nodes | ackedLength[n] ≥ length}|

function CommitLogEntries
minAcks := d(|nodes| + 1)/2e
ready := {len ∈ {1, . . . , log.length} | acks(len) ≥ minAcks}
if ready 6= {} ∧ max(ready) > commitLength ∧
log[max(ready) − 1].term = currentTerm then
for i := commitLength to max(ready) − 1 do
deliver log[i].msg to the application
end for
commitLength := max(ready)
end if
end function
Lecture 7

Replica consistency
“Consistency”
A word that means many different things in different contexts!
“Consistency”
A word that means many different things in different contexts!
I ACID: a transaction transforms the database from one
“consistent” state to another
“Consistency”
A word that means many different things in different contexts!
I ACID: a transaction transforms the database from one
“consistent” state to another
Here, “consistent” = satisfying application-specific
invariants
e.g. “every course with students enrolled must have at
least one lecturer”
“Consistency”
A word that means many different things in different contexts!
I ACID: a transaction transforms the database from one
“consistent” state to another
Here, “consistent” = satisfying application-specific
invariants
e.g. “every course with students enrolled must have at
least one lecturer”
I Read-after-write consistency (lecture 5)
“Consistency”
A word that means many different things in different contexts!
I ACID: a transaction transforms the database from one
“consistent” state to another
Here, “consistent” = satisfying application-specific
invariants
e.g. “every course with students enrolled must have at
least one lecturer”
I Read-after-write consistency (lecture 5)
I Replication: replica should be “consistent” with other
replicas
“Consistency”
A word that means many different things in different contexts!
I ACID: a transaction transforms the database from one
“consistent” state to another
Here, “consistent” = satisfying application-specific
invariants
e.g. “every course with students enrolled must have at
least one lecturer”
I Read-after-write consistency (lecture 5)
I Replication: replica should be “consistent” with other
replicas
“consistent” = in the same state? (when exactly?)
“consistent” = read operations return same result?
I Consistency model: many to choose from
Distributed transactions
Recall atomicity in the context of ACID transactions:
I A transaction either commits or aborts
Distributed transactions
Recall atomicity in the context of ACID transactions:
I A transaction either commits or aborts
I If it commits, its updates are durable
I If it aborts, it has no visible side-effects
Distributed transactions
Recall atomicity in the context of ACID transactions:
I A transaction either commits or aborts
I If it commits, its updates are durable
I If it aborts, it has no visible side-effects
I ACID consistency (preserving invariants) relies on
atomicity
Distributed transactions
Recall atomicity in the context of ACID transactions:
I A transaction either commits or aborts
I If it commits, its updates are durable
I If it aborts, it has no visible side-effects
I ACID consistency (preserving invariants) relies on
atomicity

If the transaction updates data on multiple nodes, this implies:

I Either all nodes must commit, or all must abort


Distributed transactions
Recall atomicity in the context of ACID transactions:
I A transaction either commits or aborts
I If it commits, its updates are durable
I If it aborts, it has no visible side-effects
I ACID consistency (preserving invariants) relies on
atomicity

If the transaction updates data on multiple nodes, this implies:

I Either all nodes must commit, or all must abort


I If any node crashes, all must abort

Ensuring this is the atomic commitment problem.


Looks a bit similar to consensus?
Atomic commit versus consensus

Consensus Atomic commit


One or more nodes propose Every node votes whether to
a value commit or abort
Atomic commit versus consensus

Consensus Atomic commit


One or more nodes propose Every node votes whether to
a value commit or abort
Any one of the proposed Must commit if all nodes
values is decided vote to commit; must abort
if ≥ 1 nodes vote to abort
Atomic commit versus consensus

Consensus Atomic commit


One or more nodes propose Every node votes whether to
a value commit or abort
Any one of the proposed Must commit if all nodes
values is decided vote to commit; must abort
if ≥ 1 nodes vote to abort
Crashed nodes can be Must abort if a participating
tolerated, as long as a node crashes
quorum is working
Two-phase commit (2PC)

client coordinator A B
Two-phase commit (2PC)

client coordinator A B
begin T1

T1

T1
Two-phase commit (2PC)

client coordinator A B
begin T1

T1
. . . usual transaction execution. . . T1
Two-phase commit (2PC)

client coordinator A B
begin T1

T1
. . . usual transaction execution. . . T1
commit T
1
Two-phase commit (2PC)

client coordinator A B
begin T1

T1
. . . usual transaction execution. . . T1
commit T
1
prepare
Two-phase commit (2PC)

client coordinator A B
begin T1

T1
. . . usual transaction execution. . . T1
commit T
1
prepare

ok ok
Two-phase commit (2PC)

client coordinator A B
begin T1

T1
. . . usual transaction execution. . . T1
commit T
1
prepare

ok ok

commit
Two-phase commit (2PC)

client coordinator A B
begin T1

T1
. . . usual transaction execution. . . T1
commit T
1
prepare

ok ok
decision whether
to commit or abort
commit
The coordinator in two-phase commit

What if the coordinator crashes?


The coordinator in two-phase commit

What if the coordinator crashes?


I Coordinator writes its decision to disk
I When it recovers, read decision from disk and send it to
replicas (or abort if no decision was made before crash)
The coordinator in two-phase commit

What if the coordinator crashes?


I Coordinator writes its decision to disk
I When it recovers, read decision from disk and send it to
replicas (or abort if no decision was made before crash)
I Problem: if coordinator crashes after prepare, but before
broadcasting decision, other nodes do not know how it
has decided
The coordinator in two-phase commit

What if the coordinator crashes?


I Coordinator writes its decision to disk
I When it recovers, read decision from disk and send it to
replicas (or abort if no decision was made before crash)
I Problem: if coordinator crashes after prepare, but before
broadcasting decision, other nodes do not know how it
has decided
I Replicas participating in transaction cannot commit or
abort after responding “ok” to the prepare request
(otherwise we risk violating atomicity)
The coordinator in two-phase commit

What if the coordinator crashes?


I Coordinator writes its decision to disk
I When it recovers, read decision from disk and send it to
replicas (or abort if no decision was made before crash)
I Problem: if coordinator crashes after prepare, but before
broadcasting decision, other nodes do not know how it
has decided
I Replicas participating in transaction cannot commit or
abort after responding “ok” to the prepare request
(otherwise we risk violating atomicity)
I Algorithm is blocked until coordinator recovers
Fault-tolerant two-phase commit (1/2)
on initialisation for transaction T do
commitVotes[T ] := {}; replicas[T ] := {}; decided [T ] := false
end on

on request to commit transaction T with participating nodes R do


for each r ∈ R do send (Prepare, T, R) to r
end on

on receiving (Prepare, T, R) at node replicaId do


replicas[T ] := R
ok = “is transaction T able to commit on this replica?”
total order broadcast (Vote, T, replicaId , ok ) to replicas[T ]
end on

on a node suspects node replicaId to have crashed do


for each transaction T in which replicaId participated do
total order broadcast (Vote, T, replicaId , false) to replicas[T ]
end for
end on
Fault-tolerant two-phase commit (2/2)

on delivering (Vote, T, replicaId , ok ) by total order broadcast do


if replicaId ∈/ commitVotes[T ] ∧ replicaId ∈ replicas[T ] ∧
¬decided [T ] then
if ok = true then
commitVotes[T ] := commitVotes[T ] ∪ {replicaId }
if commitVotes[T ] = replicas[T ] then
decided [T ] := true
commit transaction T at this node
end if
else
decided [T ] := true
abort transaction T at this node
end if
end if
end on
Linearizability
Multiple nodes concurrently accessing replicated data.
How do we define “consistency” here?
Linearizability
Multiple nodes concurrently accessing replicated data.
How do we define “consistency” here?

The strongest option: linearizability


Linearizability
Multiple nodes concurrently accessing replicated data.
How do we define “consistency” here?

The strongest option: linearizability


I Informally: every operation takes effect atomically
sometime after it started and before it finished
Linearizability
Multiple nodes concurrently accessing replicated data.
How do we define “consistency” here?

The strongest option: linearizability


I Informally: every operation takes effect atomically
sometime after it started and before it finished
I All operations behave as if executed on a single copy of
the data (even if there are in fact multiple replicas)
Linearizability
Multiple nodes concurrently accessing replicated data.
How do we define “consistency” here?

The strongest option: linearizability


I Informally: every operation takes effect atomically
sometime after it started and before it finished
I All operations behave as if executed on a single copy of
the data (even if there are in fact multiple replicas)
I Consequence: every operation returns an “up-to-date”
value, a.k.a. “strong consistency”
Linearizability
Multiple nodes concurrently accessing replicated data.
How do we define “consistency” here?

The strongest option: linearizability


I Informally: every operation takes effect atomically
sometime after it started and before it finished
I All operations behave as if executed on a single copy of
the data (even if there are in fact multiple replicas)
I Consequence: every operation returns an “up-to-date”
value, a.k.a. “strong consistency”
I Not just in distributed systems, also in shared-memory
concurrency (memory on multi-core CPUs is not
linearizable by default!)
Linearizability
Multiple nodes concurrently accessing replicated data.
How do we define “consistency” here?

The strongest option: linearizability


I Informally: every operation takes effect atomically
sometime after it started and before it finished
I All operations behave as if executed on a single copy of
the data (even if there are in fact multiple replicas)
I Consequence: every operation returns an “up-to-date”
value, a.k.a. “strong consistency”
I Not just in distributed systems, also in shared-memory
concurrency (memory on multi-core CPUs is not
linearizable by default!)
Note: linearizability 6= serializability!
Read-after-write consistency revisited

client A B C
Read-after-write consistency revisited

client A B C
(t1 , set(x, v
1 ))
set(x, v1 )
Read-after-write consistency revisited

client A B C
(t1 , set(x, v
1 ))
set(x, v1 )

ok ok
Read-after-write consistency revisited

client A B C
(t1 , set(x, v
1 ))
set(x, v1 )

ok ok

get(x)
get(x) → v1
Read-after-write consistency revisited

client A B C
(t1 , set(x, v
1 ))
set(x, v1 )

ok ok

get(x)
get(x) → v1

(t0, v0) (t1, v1)


From the client’s point of view
client 1 I Focus on client-observable
behaviour: when and what an
set(x, v1 )

operation returns
get(x) → v1
From the client’s point of view
client 1 I Focus on client-observable
behaviour: when and what an
set(x, v1 )

? operation returns
I Ignore how the replication
system is implemented internally
?
get(x) → v1

?
From the client’s point of view
client 1 I Focus on client-observable
behaviour: when and what an
set(x, v1 )

? operation returns
I Ignore how the replication
system is implemented internally
? I Did operation A finish before
operation B started?
real time
get(x) → v1

?
From the client’s point of view
client 1 client 2 I Focus on client-observable
behaviour: when and what an
set(x, v1 )

? operation returns
I Ignore how the replication
system is implemented internally
? I Did operation A finish before
real time operation B started?
I Even if the operations are on
get(x) → v1

?
different nodes?

?
From the client’s point of view
client 1 client 2 I Focus on client-observable
behaviour: when and what an
set(x, v1 )

? operation returns
I Ignore how the replication
system is implemented internally
? I Did operation A finish before
real time operation B started?
I Even if the operations are on
get(x) → v1

?
different nodes?
I This is not happens-before:
we want client 2 to read value
?
written by client 1, even if the
clients have not communicated!
Operations overlapping in time

client 1 client 2
I Client 2’s get operation
overlaps in time with
client 1’s set operation
set(x, v1 )

get(x) → v1
Operations overlapping in time

client 1 client 2
I Client 2’s get operation
overlaps in time with
client 1’s set operation
set(x, v1 )

I Maybe the set operation


takes effect first?
get(x) → v1
Operations overlapping in time

client 1 client 2
I Client 2’s get operation
overlaps in time with
client 1’s set operation
set(x, v1 )

I Maybe the set operation


takes effect first?
get(x) → v0
I Just as likely, the get
operation may be
executed first
Operations overlapping in time

client 1 client 2
I Client 2’s get operation
overlaps in time with
client 1’s set operation
set(x, v1 )

I Maybe the set operation


takes effect first?
get(x) → v0
I Just as likely, the get
operation may be
executed first
I Either outcome is fine in
this case
Not linearizable, despite quorum reads/writes

client 1 A B C client 2 client 3


Not linearizable, despite quorum reads/writes

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
set(x, v1 )

ok
Not linearizable, despite quorum reads/writes

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
get(x)
set(x, v1 )

get(x) → v1
ok
Not linearizable, despite quorum reads/writes

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
get(x)
set(x, v1 )

get(x) → v1
ok
(t1 , v1 )

(t0 , v0 )
Not linearizable, despite quorum reads/writes

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
get(x)
set(x, v1 )

get(x) → v1
ok
(t1 , v1 )

(t0 , v0 )
get(x)

get(x) → v0
Not linearizable, despite quorum reads/writes

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
get(x)
set(x, v1 )

get(x) → v1
ok
(t1 , v1 )

(t0 , v0 )
get(x)

get(x) → v0
(t0 , v0 )

(t0 , v0 )
Not linearizable, despite quorum reads/writes

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
get(x)
set(x, v1 )

get(x) → v1
ok
(t1 , v1 )

(t0 , v0 )
get(x)

get(x) → v0
(t0 , v0 )

(t0 , v0 )
Not linearizable, despite quorum reads/writes

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
get(x)
set(x, v1 )

get(x) → v1
ok
(t1 , v1 )

(t0 , v0 )
get(x)

get(x) → v0
(t0 , v0 )

ok (t0 , v0 )
ok
Not linearizable, despite quorum reads/writes
client 1 client 2 client 3
set(x, v1 )

get(x) → v1

get(x) → v0
Not linearizable, despite quorum reads/writes
client 1 client 2 client 3
set(x, v1 )

get(x) → v1
real time

get(x) → v0
Not linearizable, despite quorum reads/writes
client 1 client 2 client 3
I Client 2’s operation finishes
set(x, v1 )

before client 3’s operation

get(x) → v1
starts
I Linearizability therefore
requires client 3’s operation
real time
to observe a state no older

get(x) → v0
than client 2’s operation
I This example violates
linearizability because v0 is
older than v1
Making quorum reads/writes linearizable

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
set(x, v1 )

ok

get(x) → v1

get(x) → v1
...

...
Making quorum reads/writes linearizable

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
get(x)
set(x, v1 )

ok

get(x) → v1

get(x) → v1
...

...
Making quorum reads/writes linearizable

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
get(x)
set(x, v1 )

ok

get(x) → v1
(t1 , v1 )

(t0 , v0 )

get(x) → v1
...

...
Making quorum reads/writes linearizable

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
get(x)
set(x, v1 )

ok

get(x) → v1
(t1 , v1 )

(t0 , v0 )
))
(t1 , set(x, v1

get(x) → v1
...

...
Making quorum reads/writes linearizable

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
get(x)
set(x, v1 )

ok

get(x) → v1
(t1 , v1 )

(t0 , v0 )
))
(t1 , set(x, v1

get(x) → v1
...
ok
ok

...
Making quorum reads/writes linearizable

client 1 A B C client 2 client 3


(t1 , set(x,
v1 ))
get(x)
set(x, v1 )

ok

get(x) → v1
(t1 , v1 )

(t0 , v0 )
))
(t1 , set(x, v1

get(x) → v1
...
ok
ok

ok ...
ok
Linearizability for different types of operation
This ensures linearizability of get (quorum read) and
set (blind write to quorum)
Linearizability for different types of operation
This ensures linearizability of get (quorum read) and
set (blind write to quorum)
I When an operation finishes, the value read/written is
stored on a quorum of replicas
I Every subsequent quorum operation will see that value
Linearizability for different types of operation
This ensures linearizability of get (quorum read) and
set (blind write to quorum)
I When an operation finishes, the value read/written is
stored on a quorum of replicas
I Every subsequent quorum operation will see that value
I Multiple concurrent writes may overwrite each other
Linearizability for different types of operation
This ensures linearizability of get (quorum read) and
set (blind write to quorum)
I When an operation finishes, the value read/written is
stored on a quorum of replicas
I Every subsequent quorum operation will see that value
I Multiple concurrent writes may overwrite each other

What about an atomic compare-and-swap operation?


I CAS(x, oldValue, newValue) sets x to newValue iff
current value of x is oldValue
I Previously discussed in shared memory concurrency
Linearizability for different types of operation
This ensures linearizability of get (quorum read) and
set (blind write to quorum)
I When an operation finishes, the value read/written is
stored on a quorum of replicas
I Every subsequent quorum operation will see that value
I Multiple concurrent writes may overwrite each other

What about an atomic compare-and-swap operation?


I CAS(x, oldValue, newValue) sets x to newValue iff
current value of x is oldValue
I Previously discussed in shared memory concurrency
I Can we implement linearizable compare-and-swap in a
distributed system?
Linearizability for different types of operation
This ensures linearizability of get (quorum read) and
set (blind write to quorum)
I When an operation finishes, the value read/written is
stored on a quorum of replicas
I Every subsequent quorum operation will see that value
I Multiple concurrent writes may overwrite each other

What about an atomic compare-and-swap operation?


I CAS(x, oldValue, newValue) sets x to newValue iff
current value of x is oldValue
I Previously discussed in shared memory concurrency
I Can we implement linearizable compare-and-swap in a
distributed system?
I Yes: total order broadcast to the rescue again!
Linearizable compare-and-swap (CAS)
on request to perform get(x) do
total order broadcast (get, x) and wait for delivery
end on

on request to perform CAS(x, old , new ) do


total order broadcast (CAS, x, old , new ) and wait for delivery
end on

on delivering (get, x) by total order broadcast do


return localState[x] as result of operation get(x)
end on

on delivering (CAS, x, old , new ) by total order broadcast do


success := false
if localState[x] = old then
localState[x] := new ; success := true
end if
return success as result of operation CAS(x, old , new )
end on
Eventual consistency
Linearizability advantages:
I Makes a distributed system behave as if it were
non-distributed
I Simple for applications to use
Eventual consistency
Linearizability advantages:
I Makes a distributed system behave as if it were
non-distributed
I Simple for applications to use

Downsides:
I Performance cost: lots of messages and waiting for
responses
Eventual consistency
Linearizability advantages:
I Makes a distributed system behave as if it were
non-distributed
I Simple for applications to use

Downsides:
I Performance cost: lots of messages and waiting for
responses
I Scalability limits: leader can be a bottleneck
Eventual consistency
Linearizability advantages:
I Makes a distributed system behave as if it were
non-distributed
I Simple for applications to use

Downsides:
I Performance cost: lots of messages and waiting for
responses
I Scalability limits: leader can be a bottleneck
I Availability problems: if you can’t contact a quorum of
nodes, you can’t process any operations
Eventual consistency
Linearizability advantages:
I Makes a distributed system behave as if it were
non-distributed
I Simple for applications to use

Downsides:
I Performance cost: lots of messages and waiting for
responses
I Scalability limits: leader can be a bottleneck
I Availability problems: if you can’t contact a quorum of
nodes, you can’t process any operations

Eventual consistency: a weaker model than linearizability.


Different trade-off choices.
The CAP theorem
A system can be either strongly Consistent (linearizable) or
Available in the presence of a network Partition

node A
set(x, v1 ) node B node C

network partition
get(x) → v1

get(x) → v1

get(x) → v0
C must either wait indefinitely for the network to recover, or
return a potentially stale value
Eventual consistency
Replicas process operations based only on their local state.
If there are no more updates, eventually all replicas will be in
the same state. (No guarantees how long it might take.)
Eventual consistency
Replicas process operations based only on their local state.
If there are no more updates, eventually all replicas will be in
the same state. (No guarantees how long it might take.)
Strong eventual consistency:
I Eventual delivery: every update made to one non-faulty
replica is eventually processed by every non-faulty replica.
Eventual consistency
Replicas process operations based only on their local state.
If there are no more updates, eventually all replicas will be in
the same state. (No guarantees how long it might take.)
Strong eventual consistency:
I Eventual delivery: every update made to one non-faulty
replica is eventually processed by every non-faulty replica.
I Convergence: any two replicas that have processed the
same set of updates are in the same state
(even if updates were processed in a different order).
Eventual consistency
Replicas process operations based only on their local state.
If there are no more updates, eventually all replicas will be in
the same state. (No guarantees how long it might take.)
Strong eventual consistency:
I Eventual delivery: every update made to one non-faulty
replica is eventually processed by every non-faulty replica.
I Convergence: any two replicas that have processed the
same set of updates are in the same state
(even if updates were processed in a different order).
Properties:
I Does not require waiting for network communication
I Causal broadcast (or weaker) can disseminate updates
Eventual consistency
Replicas process operations based only on their local state.
If there are no more updates, eventually all replicas will be in
the same state. (No guarantees how long it might take.)
Strong eventual consistency:
I Eventual delivery: every update made to one non-faulty
replica is eventually processed by every non-faulty replica.
I Convergence: any two replicas that have processed the
same set of updates are in the same state
(even if updates were processed in a different order).
Properties:
I Does not require waiting for network communication
I Causal broadcast (or weaker) can disseminate updates
I Concurrent updates =⇒ conflicts need to be resolved
Summary of minimum system model requirements

Problem Must wait for Requires


communication synchrony
atomic commit all participating partially
nodes synchronous

strength of assumptions
Summary of minimum system model requirements

Problem Must wait for Requires


communication synchrony
atomic commit all participating partially
nodes synchronous

strength of assumptions
consensus, quorum partially
total order broadcast, synchronous
linearizable CAS
Summary of minimum system model requirements

Problem Must wait for Requires


communication synchrony
atomic commit all participating partially
nodes synchronous

strength of assumptions
consensus, quorum partially
total order broadcast, synchronous
linearizable CAS
linearizable get/set quorum asynchronous
Summary of minimum system model requirements

Problem Must wait for Requires


communication synchrony
atomic commit all participating partially
nodes synchronous

strength of assumptions
consensus, quorum partially
total order broadcast, synchronous
linearizable CAS
linearizable get/set quorum asynchronous
eventual consistency, local replica only asynchronous
causal broadcast,
FIFO broadcast
Lecture 8

Concurrency control in applications


Collaboration and conflict resolution
Nowadays we use a lot of collaboration software:
I Examples: calendar sync (last lecture), Google Docs, . . .
Collaboration and conflict resolution
Nowadays we use a lot of collaboration software:
I Examples: calendar sync (last lecture), Google Docs, . . .
I Several users/devices working on a shared file/document
I Each user device has local replica of the data
Collaboration and conflict resolution
Nowadays we use a lot of collaboration software:
I Examples: calendar sync (last lecture), Google Docs, . . .
I Several users/devices working on a shared file/document
I Each user device has local replica of the data
I Update local replica anytime (even while offline),
sync with others when network available
Collaboration and conflict resolution
Nowadays we use a lot of collaboration software:
I Examples: calendar sync (last lecture), Google Docs, . . .
I Several users/devices working on a shared file/document
I Each user device has local replica of the data
I Update local replica anytime (even while offline),
sync with others when network available
I Challenge: how to reconcile concurrent updates?
Collaboration and conflict resolution
Nowadays we use a lot of collaboration software:
I Examples: calendar sync (last lecture), Google Docs, . . .
I Several users/devices working on a shared file/document
I Each user device has local replica of the data
I Update local replica anytime (even while offline),
sync with others when network available
I Challenge: how to reconcile concurrent updates?

Families of algorithms:
I Conflict-free Replicated Data Types (CRDTs)
I Operation-based
I State-based
I Operational Transformation (OT)
Conflicts due to concurrent updates
node A node B

network partition
{ {
"title": "Lecture", "title": "Lecture",
"date": "2020-11-05", "date": "2020-11-05",
"time": "12:00" "time": "12:00"
} }
Conflicts due to concurrent updates
node A node B

network partition
{ {
"title": "Lecture", "title": "Lecture",
"date": "2020-11-05", "date": "2020-11-05",
"time": "12:00" "time": "12:00"
} }

title = "Lecture 1"


{
"title": "Lecture 1",
"date": "2020-11-05",
"time": "12:00"
}
Conflicts due to concurrent updates
node A node B

network partition
{ {
"title": "Lecture", "title": "Lecture",
"date": "2020-11-05", "date": "2020-11-05",
"time": "12:00" "time": "12:00"
} }

title = "Lecture 1" time = "10:00"


{ {
"title": "Lecture 1", "title": "Lecture",
"date": "2020-11-05", "date": "2020-11-05",
"time": "12:00" "time": "10:00"
} }
Conflicts due to concurrent updates
node A node B

network partition
{ {
"title": "Lecture", "title": "Lecture",
"date": "2020-11-05", "date": "2020-11-05",
"time": "12:00" "time": "12:00"
} }

title = "Lecture 1" time = "10:00"


{ {
"title": "Lecture 1", "title": "Lecture",
"date": "2020-11-05", "date": "2020-11-05",
"time": "12:00" "time": "10:00"
} }
sync

{ {
"title": "Lecture 1", "title": "Lecture 1",
"date": "2020-11-05", "date": "2020-11-05",
"time": "10:00" "time": "10:00"
} }
Operation-based map CRDT
on initialisation do
values := {}
end on

on request to read value for key k do


if ∃t, v. (t, k, v) ∈ values then return v else return null
end on

on request to set key k to value v do


t := newTimestamp() . globally unique, e.g. Lamport timestamp
broadcast (set, t, k, v) by reliable broadcast (including to self)
end on

on delivering (set, t, k, v) by reliable broadcast do


previous := {(t0 , k 0 , v 0 ) ∈ values | k 0 = k}
if previous = {} ∨ ∀(t0 , k 0 , v 0 ) ∈ previous. t0 < t then
values := (values \ previous) ∪ {(t, k, v)}
end if
end on
Operation-based CRDTs
Reliable broadcast may deliver updates in any order:
I broadcast (set, t1 , “title”, “Lecture 1”)
I broadcast (set, t2 , “time”, “10:00”)
Operation-based CRDTs
Reliable broadcast may deliver updates in any order:
I broadcast (set, t1 , “title”, “Lecture 1”)
I broadcast (set, t2 , “time”, “10:00”)

Recall strong eventual consistency:


I Eventual delivery: every update made to one non-faulty
replica is eventually processed by every non-faulty replica.
I Convergence: any two replicas that have processed the
same set of updates are in the same state
Operation-based CRDTs
Reliable broadcast may deliver updates in any order:
I broadcast (set, t1 , “title”, “Lecture 1”)
I broadcast (set, t2 , “time”, “10:00”)

Recall strong eventual consistency:


I Eventual delivery: every update made to one non-faulty
replica is eventually processed by every non-faulty replica.
I Convergence: any two replicas that have processed the
same set of updates are in the same state

CRDT algorithm implements this:


I Reliable broadcast ensures every operation is eventually
delivered to every (non-crashed) replica
Operation-based CRDTs
Reliable broadcast may deliver updates in any order:
I broadcast (set, t1 , “title”, “Lecture 1”)
I broadcast (set, t2 , “time”, “10:00”)

Recall strong eventual consistency:


I Eventual delivery: every update made to one non-faulty
replica is eventually processed by every non-faulty replica.
I Convergence: any two replicas that have processed the
same set of updates are in the same state

CRDT algorithm implements this:


I Reliable broadcast ensures every operation is eventually
delivered to every (non-crashed) replica
I Applying an operation is commutative: order of delivery
doesn’t matter
State-based map CRDT
The operator t merges two states s1 and s2 as follows:
s1 t s2 = {(t, k, v) ∈ (s1 ∪ s2 ) | @(t0 , k 0 , v 0 ) ∈ (s1 ∪ s2 ). k 0 = k ∧ t0 > t}
on initialisation do
values := {}
end on

on request to read value for key k do


if ∃t, v. (t, k, v) ∈ values then return v else return null
end on

on request to set key k to value v do


t := newTimestamp() . globally unique, e.g. Lamport timestamp
values := {(t0 , k 0 , v 0 ) ∈ values | k 0 6= k} ∪ {(t, k, v)}
broadcast values by best-effort broadcast
end on

on delivering V by best-effort broadcast do


values := values t V
end on
State-based CRDTs

Merge operator t must satisfy: ∀s1 , s2 , s3 . . .


I Commutative: s1 t s2 = s2 t s1 .
I Associative: (s1 t s2 ) t s3 = s1 t (s2 t s3 ).
I Idempotent: s1 t s1 = s1 .
State-based CRDTs

Merge operator t must satisfy: ∀s1 , s2 , s3 . . .


I Commutative: s1 t s2 = s2 t s1 .
I Associative: (s1 t s2 ) t s3 = s1 t (s2 t s3 ).
I Idempotent: s1 t s1 = s1 .

State-based versus operation-based:


I Op-based CRDT typically has smaller messages
I State-based CRDT can tolerate message loss/duplication
State-based CRDTs

Merge operator t must satisfy: ∀s1 , s2 , s3 . . .


I Commutative: s1 t s2 = s2 t s1 .
I Associative: (s1 t s2 ) t s3 = s1 t (s2 t s3 ).
I Idempotent: s1 t s1 = s1 .

State-based versus operation-based:


I Op-based CRDT typically has smaller messages
I State-based CRDT can tolerate message loss/duplication

Not necessarily uses broadcast:


I Can also merge concurrent updates to replicas e.g. in
quorum replication, anti-entropy, . . .
Collaborative text editing: the problem

user A user B

network partition
B C B C
0 1 0 1
Collaborative text editing: the problem

user A user B

network partition
B C B C
0 1 0 1

insert(0, “A”)

A B C
0 1 2
Collaborative text editing: the problem

user A user B

network partition
B C B C
0 1 0 1

insert(0, “A”) insert(2, “D”)

A B C B C D
0 1 2 0 1 2
Collaborative text editing: the problem

user A user B

network partition
B C B C
0 1 0 1

insert(0, “A”) insert(2, “D”)

A B C B C D
0 1 2 0 1 2

(insert, 0, “A”)

A B C D
0 1 2 3
Collaborative text editing: the problem

user A user B

network partition
B C B C
0 1 0 1

insert(0, “A”) insert(2, “D”)

A B C B C D
0 1 2 0 1 2

(insert, 0, “A”) (insert, 2, “D”)

A B D C A B C D
0 1 2 3 0 1 2 3
Operational transformation
user A user B
insert(0, “A”) insert(2, “D”)

A B C B C D
0 1 2 0 1 2
Operational transformation
user A user B
insert(0, “A”) insert(2, “D”)

A B C B C D
0 1 2 0 1 2

(insert, 2, “D”)

T ((insert, 2, “D”),
(insert, 0, “A”)) =
(insert, 3, “D”)

A B C D
0 1 2 3
Operational transformation
user A user B
insert(0, “A”) insert(2, “D”)

A B C B C D
0 1 2 0 1 2

(insert, 0, “A”) (insert, 2, “D”)

T ((insert, 2, “D”), T ((insert, 0, “A”),


(insert, 0, “A”)) = (insert, 2, “D”)) =
(insert, 3, “D”) (insert, 0, “A”)

A B C D A B C D
0 1 2 3 0 1 2 3
Text editing CRDT

user A user B

` B C a ` B C a
0.0 0.5 0.75 1.0 0.0 0.5 0.75 1.0
Text editing CRDT

user A user B

` B C a ` B C a
0.0 0.5 0.75 1.0 0.0 0.5 0.75 1.0

insert(0.25, “A”)

` A B C a
0.0 0.25 0.5 0.75 1.0
Text editing CRDT

user A user B

` B C a ` B C a
0.0 0.5 0.75 1.0 0.0 0.5 0.75 1.0

insert(0.25, “A”) insert(0.875, “D”)

` A B C a ` B C D a
0.0 0.25 0.5 0.75 1.0 0.0 0.5 0.75 0.875 1.0
Text editing CRDT

user A user B

` B C a ` B C a
0.0 0.5 0.75 1.0 0.0 0.5 0.75 1.0

insert(0.25, “A”) insert(0.875, “D”)

` A B C a ` B C D a
0.0 0.25 0.5 0.75 1.0 0.0 0.5 0.75 0.875 1.0

(insert, 0.875, “D”)

` A B C D a
0.0 0.25 0.5 0.75 0.875 1.0
Text editing CRDT

user A user B

` B C a ` B C a
0.0 0.5 0.75 1.0 0.0 0.5 0.75 1.0

insert(0.25, “A”) insert(0.875, “D”)

` A B C a ` B C D a
0.0 0.25 0.5 0.75 1.0 0.0 0.5 0.75 0.875 1.0

(insert, 0.25, “A”) (insert, 0.875, “D”)

` A B C D a ` A B C D a
0.0 0.25 0.5 0.75 0.875 1.0 0.0 0.25 0.5 0.75 0.875 1.0
Operation-based text CRDT (1/2)
function ElementAt(chars, index )
min = the unique triple (p, n, v) ∈ chars such that
@(p0 , n0 , v 0 ) ∈ chars. p0 < p ∨ (p0 = p ∧ n0 < n)}
if index = 0 then return min
else return ElementAt(chars \ {min}, index − 1)
end function

on initialisation do
chars := {(0, null, `), (1, null, a)}
end on

on request to read character at index index do


let (p, n, v) := ElementAt(chars, index + 1); return v
end on

on request to insert character v at index index at node nodeId do


let (p1 , n1 , v1 ) := ElementAt(chars, index )
let (p2 , n2 , v2 ) := ElementAt(chars, index + 1)
broadcast (insert, (p1 + p2 )/2, nodeId , v) by causal broadcast
end on
Operation-based text CRDT (2/2)
on delivering (insert, p, n, v) by causal broadcast do
chars := chars ∪ {(p, n, v)}
end on

on request to delete character at index index do


let (p, n, v) := ElementAt(chars, index + 1)
broadcast (delete, p, n) by causal broadcast
end on

on delivering (delete, p, n) by causal broadcast do


chars := {(p0 , n0 , v 0 ) ∈ chars | ¬(p0 = p ∧ n0 = n)}
end on

I Use causal broadcast so that insertion of a character is


delivered before its deletion
I Insertion and deletion of different characters commute
Google’s Spanner
A database system with millions of nodes, petabytes of data,
distributed across datacenters worldwide
Google’s Spanner
A database system with millions of nodes, petabytes of data,
distributed across datacenters worldwide

Consistency properties:
I Serializable transaction isolation
I Linearizable reads and writes
Google’s Spanner
A database system with millions of nodes, petabytes of data,
distributed across datacenters worldwide

Consistency properties:
I Serializable transaction isolation
I Linearizable reads and writes
I Many shards, each holding a subset of the data;
atomic commit of transactions across shards
Google’s Spanner
A database system with millions of nodes, petabytes of data,
distributed across datacenters worldwide

Consistency properties:
I Serializable transaction isolation
I Linearizable reads and writes
I Many shards, each holding a subset of the data;
atomic commit of transactions across shards

Many standard techniques:


I State machine replication (Paxos) within a shard
I Two-phase locking for serializability
I Two-phase commit for cross-shard atomicity
Google’s Spanner
A database system with millions of nodes, petabytes of data,
distributed across datacenters worldwide

Consistency properties:
I Serializable transaction isolation
I Linearizable reads and writes
I Many shards, each holding a subset of the data;
atomic commit of transactions across shards

Many standard techniques:


I State machine replication (Paxos) within a shard
I Two-phase locking for serializability
I Two-phase commit for cross-shard atomicity

The interesting bit: read-only transactions require no locks!


Consistent snapshots
A read-only transaction observes a consistent snapshot:
If T1 → T2 (e.g. T2 reads data written by T1 ). . .
I Snapshot reflecting writes by T2 also reflects writes by T1
I Snapshot that does not reflect writes by T1 does not
reflect writes by T2 either
Consistent snapshots
A read-only transaction observes a consistent snapshot:
If T1 → T2 (e.g. T2 reads data written by T1 ). . .
I Snapshot reflecting writes by T2 also reflects writes by T1
I Snapshot that does not reflect writes by T1 does not
reflect writes by T2 either
I In other words, snapshot is consistent with causality
I Even if read-only transaction runs for a long time
Consistent snapshots
A read-only transaction observes a consistent snapshot:
If T1 → T2 (e.g. T2 reads data written by T1 ). . .
I Snapshot reflecting writes by T2 also reflects writes by T1
I Snapshot that does not reflect writes by T1 does not
reflect writes by T2 either
I In other words, snapshot is consistent with causality
I Even if read-only transaction runs for a long time

Approach: multi-version concurrency control (MVCC)


I Each read-write transaction Tw has commit timestamp tw
I Every value is tagged with timestamp tw of transaction
that wrote it (not overwriting previous value)
Consistent snapshots
A read-only transaction observes a consistent snapshot:
If T1 → T2 (e.g. T2 reads data written by T1 ). . .
I Snapshot reflecting writes by T2 also reflects writes by T1
I Snapshot that does not reflect writes by T1 does not
reflect writes by T2 either
I In other words, snapshot is consistent with causality
I Even if read-only transaction runs for a long time

Approach: multi-version concurrency control (MVCC)


I Each read-write transaction Tw has commit timestamp tw
I Every value is tagged with timestamp tw of transaction
that wrote it (not overwriting previous value)
I Read-only transaction Tr has snapshot timestamp tr
I Tr ignores values with tw > tr ; observes most recent
value with tw ≤ tr
Obtaining commit timestamps
Must ensure that whenever T1 → T2 we have t1 < t2 .
I Physical clocks may be inconsistent with causality
Obtaining commit timestamps
Must ensure that whenever T1 → T2 we have t1 < t2 .
I Physical clocks may be inconsistent with causality
I Can we use Lamport clocks instead?
I Problem: linearizability depends on real-time order, and
logical clocks may not reflect this!
Obtaining commit timestamps
Must ensure that whenever T1 → T2 we have t1 < t2 .
I Physical clocks may be inconsistent with causality
I Can we use Lamport clocks instead?
I Problem: linearizability depends on real-time order, and
logical clocks may not reflect this!

A B
Obtaining commit timestamps
Must ensure that whenever T1 → T2 we have t1 < t2 .
I Physical clocks may be inconsistent with causality
I Can we use Lamport clocks instead?
I Problem: linearizability depends on real-time order, and
logical clocks may not reflect this!

A B

T1
results
Obtaining commit timestamps
Must ensure that whenever T1 → T2 we have t1 < t2 .
I Physical clocks may be inconsistent with causality
I Can we use Lamport clocks instead?
I Problem: linearizability depends on real-time order, and
logical clocks may not reflect this!

A B

T1
results
action

T2
TrueTime: explicit physical clock uncertainty

physical
A time B

T1
TrueTime: explicit physical clock uncertainty
Spanner’s TrueTime clock returns [tearliest , tlatest ].
True physical timestamp must lie within that range.

physical
A time B

T1 t1,earliest
commit req

t1,latest
TrueTime: explicit physical clock uncertainty
Spanner’s TrueTime clock returns [tearliest , tlatest ].
True physical timestamp must lie within that range.
On commit, wait for uncertainty δi = ti,latest − ti,earliest .

physical
A time B

T1 t1,earliest
commit req δ1
wait

δ1 t1,latest
commit done
TrueTime: explicit physical clock uncertainty
Spanner’s TrueTime clock returns [tearliest , tlatest ].
True physical timestamp must lie within that range.
On commit, wait for uncertainty δi = ti,latest − ti,earliest .

physical
A time B

T1 t1,earliest
commit req δ1
wait

δ1 t1,latest
commit done

t2,earliest T2
commit req

t2,latest
TrueTime: explicit physical clock uncertainty
Spanner’s TrueTime clock returns [tearliest , tlatest ].
True physical timestamp must lie within that range.
On commit, wait for uncertainty δi = ti,latest − ti,earliest .

physical
A time B

T1 t1,earliest
commit req δ1
wait

δ1 t1,latest
commit done

t2,earliest T2
δ2 commit req

wait
t2,latest δ2
commit done
TrueTime: explicit physical clock uncertainty
Spanner’s TrueTime clock returns [tearliest , tlatest ].
True physical timestamp must lie within that range.
On commit, wait for uncertainty δi = ti,latest − ti,earliest .

physical
A time B

T1 t1,earliest
commit req δ1
wait

δ1 t1,latest
commit done real time

t2,earliest T2
δ2 commit req

wait
t2,latest δ2
commit done
Determining clock uncertainty in TrueTime
Clock servers with atomic clock or GPS receiver in each
datacenter; servers report their clock uncertainty.

local clock uncertainty [ms]

2
time [s]
0
0 10 20 30 40 50 60 70 80 90
Determining clock uncertainty in TrueTime
Clock servers with atomic clock or GPS receiver in each
datacenter; servers report their clock uncertainty.
Each node syncs its quartz clock with a server every 30 sec.

local clock uncertainty [ms]

2
time [s]
0
0 10 20 30 40 50 60 70 80 90

sync with clock server


Determining clock uncertainty in TrueTime
Clock servers with atomic clock or GPS receiver in each
datacenter; servers report their clock uncertainty.
Each node syncs its quartz clock with a server every 30 sec.
Between syncs, assume worst-case drift of 200ppm.

local clock uncertainty [ms]

2
time [s]
0
0 10 20 30 40 50 60 70 80 90

sync with clock server


Determining clock uncertainty in TrueTime
Clock servers with atomic clock or GPS receiver in each
datacenter; servers report their clock uncertainty.
Each node syncs its quartz clock with a server every 30 sec.
Between syncs, assume worst-case drift of 200ppm.

local clock uncertainty [ms]

2
time [s]
server uncertainty + round trip time to clock server
0
0 10 20 30 40 50 60 70 80 90

sync with clock server


That’s all, folks!

Any questions? Email mk428@cst.cam.ac.uk!

Summary:
I Distributed systems are everywhere
I You use them every day: e.g. web apps
I Key goals: availability, scalability, performance
I Key problems: concurrency, faults, unbounded latency
I Key abstractions: replication, broadcast, consensus
I No one right way, just trade-offs

You might also like