Slides
Slides
This is hard.
message m
node i node j
Distributed Systems and Computer Networking
message m
node i node j
I Physical communication:
electric current, radio waves, laser, hard drives in a van. . .
Hard drives in a van?!
https://docs.aws.amazon.com/snowball/latest/ug/using-device.html
html>...
! DO CT Y P E html><
<
request message response message
Client-server example: online payments
success
Remote Procedure Call (RPC) example
if (result.isSuccess()) {
fulfilOrder();
}
Remote Procedure Call (RPC) example
if (result.isSuccess()) {
fulfilOrder();
}
processPayment() stub
waiting
online shop RPC client RPC server payment service
processPayment() stub
marshal args
m1
unmarshal args
waiting
{
"request": "processPayment",
"card": {
"number": "1234567887654321",
"expiryDate": "10/2024",
m1 = "CVC": "123"
},
"amount": 3.99,
"currency": "GBP"
}
online shop RPC client RPC server payment service
processPayment() stub
marshal args
m1
unmarshal args
processPayment()
waiting implementation
{
"request": "processPayment",
"card": {
"number": "1234567887654321",
"expiryDate": "10/2024",
m1 = "CVC": "123"
},
"amount": 3.99,
"currency": "GBP"
}
online shop RPC client RPC server payment service
processPayment() stub
marshal args
m1
unmarshal args
processPayment()
waiting implementation
m2 marshal result
unmarshal result
{
"request": "processPayment",
"card": {
"number": "1234567887654321", {
"expiryDate": "10/2024", "result": "success",
m1 = "CVC": "123" m2 = "id": "XP61hHw2Rvo"
}, }
"amount": 3.99,
"currency": "GBP"
}
online shop RPC client RPC server payment service
processPayment() stub
marshal args
m1
unmarshal args
processPayment()
waiting implementation
m2 marshal result
unmarshal result
function returns
{
"request": "processPayment",
"card": {
"number": "1234567887654321", {
"expiryDate": "10/2024", "result": "success",
m1 = "CVC": "123" m2 = "id": "XP61hHw2Rvo"
}, }
"amount": 3.99,
"currency": "GBP"
}
Remote Procedure Call (RPC)
“Location transparency”:
system hides where a resource is located.
Remote Procedure Call (RPC)
“Location transparency”:
system hides where a resource is located.
In practice. . .
I what if the service crashes during the function call?
I what if a message is lost?
I what if a message is delayed?
I if something goes wrong, is it safe to retry?
RPC history
fetch('https://example.com/payments', request)
.then((response) => {
if (response.ok) success(response.json());
else failure(response.status); // server error
})
.catch((error) => {
failure(error); // network error
});
RPC in enterprise systems
message PaymentStatus {
required bool success = 1;
optional string errorMessage = 2;
}
service PaymentService {
rpc ProcessPayment(PaymentRequest) returns (PaymentStatus) {}
}
Lecture 2
army 1 army 2
messengers
The two generals problem
city
attack? attack?
army 1 army 2
messengers
ed!
10 Nov agre
The two generals problem
general 1 general 2
attack 10
Nov, okay
?
ed!
10 Nov agre
general 1 general 2
attack 10
Nov, okay
?
How should the generals decide?
1. General 1 always attacks, even if no response is received?
I Send lots of messengers to increase probability that one
will get through
I If all are captured, general 2 does not know about the
attack, so general 1 loses
How should the generals decide?
1. General 1 always attacks, even if no response is received?
I Send lots of messengers to increase probability that one
will get through
I If all are captured, general 2 does not know about the
attack, so general 1 loses
customer
dispatch goods charge credit card
customer
dispatch goods charge credit card
attack?
messengers messengers
city
attack? attack?
army 1 army 2
messengers
agree?
RPC RPC
order
agree? agree?
Source: https://commons.wikimedia.org/wiki/File:Byzantiumby650AD.svg
I Network:
reliable, fair-loss, or arbitrary
I Nodes:
crash-stop, crash-recovery, or Byzantine
I Timing:
synchronous, partially synchronous, or asynchronous
Fault tolerance:
system as a whole continues working, despite faults
(some maximum number of faults assumed)
Problem:
cannot tell the difference between crashed node, temporarily
unresponsive node, lost message, and delayed message
Failure detection in partially synchronous systems
What happened?
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
I Performance measurements, statistics, profiling
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
I Performance measurements, statistics, profiling
I Log files & databases: record when an event occurred
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
I Performance measurements, statistics, profiling
I Log files & databases: record when an event occurred
I Data with time-limited validity (e.g. cache entries)
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
I Performance measurements, statistics, profiling
I Log files & databases: record when an event occurred
I Data with time-limited validity (e.g. cache entries)
I Determining order of events across several nodes
Clocks and time in distributed systems
Distributed systems often need to measure time, e.g.:
I Schedulers, timeouts, failure detectors, retry timers
I Performance measurements, statistics, profiling
I Log files & databases: record when an event occurred
I Data with time-limited validity (e.g. cache entries)
I Determining order of events across several nodes
I Quartz crystal
laser-trimmed to
mechanically resonate at a
specific frequency
I Piezoelectric effect:
mechanical force ⇔
electric field
I Oscillator circuit produces
signal at resonant
frequency
I Count number of cycles to
measure elapsed time
Quartz clock error: drift
I One clock runs slightly fast, another slightly slow
I Drift measured in parts per million (ppm)
I 1 ppm = 1 microsecond/second = 86 ms/day = 32 s/year
I Most computer clocks correct within ≈ 50 ppm
Temperature
significantly
affects drift
Atomic clocks
I Caesium-133 has a
resonance (“hyperfine
transition”) at ≈ 9 GHz
I Tune an electronic
oscillator to that resonant
frequency
I 1 second = 9,192,631,770
periods of that signal
I Accuracy ≈ 1 in 10−14 (1 https:
second in 3 million years) //www.microsemi.com/product-directory/
cesium-frequency-references/
I Price ≈ £20,000 (?) 4115-5071a-cesium-primary-frequency-standard
(can get cheaper rubidium
clocks for ≈ £1,000)
GPS as time source
http://leapsecond.com/notes/leap-watch.htm
How computers represent timestamps
By ignoring them!
https://www.flickr.com/
photos/ru boff/
37915499055/
How most software deals with leap seconds
By ignoring them!
https://www.flickr.com/
photos/ru boff/
37915499055/
How most software deals with leap seconds
By ignoring them!
https://www.flickr.com/
photos/ru boff/
37915499055/
How most software deals with leap seconds
By ignoring them!
t1 request: t1
t2
Estimating time over a network
NTP client NTP server
t1 request: t1
t2
response: (t1 , t2 , t3 ) t3
t4
Estimating time over a network
NTP client NTP server
t1 request: t1
t2
response: (t1 , t2 , t3 ) t3
t4
t1 request: t1
t2
response: (t1 , t2 , t3 ) t3
t4
δ
Estimated server time when client receives response: t3 +
2
Estimating time over a network
NTP client NTP server
t1 request: t1
t2
response: (t1 , t2 , t3 ) t3
t4
δ
Estimated server time when client receives response: t3 +
2
δ t2 − t1 + t3 − t4
Estimated clock skew: θ = t3 + − t4 =
2 2
Correcting clock skew
Once the client has estimated the clock skew θ, it needs to
apply that correction to its clock.
// BAD:
long startTime = System.currentTimeMillis();
doSomething();
long endTime = System.currentTimeMillis();
long elapsedMillis = endTime - startTime;
// elapsedMillis may be negative!
Monotonic and time-of-day clocks
// BAD:
long startTime = System.currentTimeMillis();
doSomething();
long endTime = System.currentTimeMillis();
long elapsedMillis = endTime - startTime;
// elapsedMillis may be negative!
NTP client steps the clock during this
Monotonic and time-of-day clocks
// BAD:
long startTime = System.currentTimeMillis();
doSomething();
long endTime = System.currentTimeMillis();
long elapsedMillis = endTime - startTime;
// elapsedMillis may be negative!
NTP client steps the clock during this
// GOOD:
long startTime = System.nanoTime();
doSomething();
long endTime = System.nanoTime();
long elapsedNanos = endTime - startTime;
// elapsedNanos is always >= 0
Monotonic and time-of-day clocks
Time-of-day clock:
I Time since a fixed date (e.g. 1 January 1970 epoch)
Monotonic clock:
I Time since arbitrary point (e.g. when machine booted up)
Monotonic and time-of-day clocks
Time-of-day clock:
I Time since a fixed date (e.g. 1 January 1970 epoch)
I May suddenly move forwards or backwards (NTP
stepping), subject to leap second adjustments
Monotonic clock:
I Time since arbitrary point (e.g. when machine booted up)
I Always moves forwards at near-constant rate
Monotonic and time-of-day clocks
Time-of-day clock:
I Time since a fixed date (e.g. 1 January 1970 epoch)
I May suddenly move forwards or backwards (NTP
stepping), subject to leap second adjustments
I Timestamps can be compared across nodes (if synced)
Monotonic clock:
I Time since arbitrary point (e.g. when machine booted up)
I Always moves forwards at near-constant rate
I Good for measuring elapsed time on a single node
Monotonic and time-of-day clocks
Time-of-day clock:
I Time since a fixed date (e.g. 1 January 1970 epoch)
I May suddenly move forwards or backwards (NTP
stepping), subject to leap second adjustments
I Timestamps can be compared across nodes (if synced)
I Java: System.currentTimeMillis()
I Linux: clock_gettime(CLOCK_REALTIME)
Monotonic clock:
I Time since arbitrary point (e.g. when machine booted up)
I Always moves forwards at near-constant rate
I Good for measuring elapsed time on a single node
I Java: System.nanoTime()
I Linux: clock_gettime(CLOCK_MONOTONIC)
Ordering of messages
A B C
a e
b m1
c
d m2
f
Happens-before relation example
A B C
a e
b m1
c
d m2
f
A B C
a e
b m1
c
d m2
f
A B C
a e
b m1
c
d m2
f
A B C
a e
b m1
c
d m2
f
distance in space
a b time
Causality
Taken from physics (relativity).
I When a → b, then a might have caused b.
I When a k b, we know that a cannot have caused b.
Happens-before relation encodes potential causality.
distance in space
a b time
light from a
Causality
Taken from physics (relativity).
I When a → b, then a might have caused b.
I When a k b, we know that a cannot have caused b.
Happens-before relation encodes potential causality.
distance in space
a b time
distance in space
a b time
distance in space
a b time
A B C
(1, A) (1, C)
(2, A) (2, m1 )
(3, B)
(3, A) (4, B)
(4, m2 ) (5, C)
A B C
(1, A) (1, C)
(2, A) (2, m1 )
(3, B)
(3, A) (4, B)
(4, m2 ) (5, C)
Given Lamport timestamps L(a) and L(b) with L(a) < L(b)
we can’t tell whether a → b or a k b.
Given Lamport timestamps L(a) and L(b) with L(a) < L(b)
we can’t tell whether a → b or a k b.
Given Lamport timestamps L(a) and L(b) with L(a) < L(b)
we can’t tell whether a → b or a k b.
Given Lamport timestamps L(a) and L(b) with L(a) < L(b)
we can’t tell whether a → b or a k b.
Given Lamport timestamps L(a) and L(b) with L(a) < L(b)
we can’t tell whether a → b or a k b.
A B C
h1, 0, 0i h0, 0, 1i
(h2, 0, 0i, m
h2, 0, 0i 1)
h2, 1, 0i
h3, 0, 0i h2, 2, 0i
(h2, 2, 0i, m h2, 2, 2i
2)
Vector clocks example
Assuming the vector of nodes is N = hA, B, Ci:
A B C
h1, 0, 0i h0, 0, 1i
(h2, 0, 0i, m
h2, 0, 0i 1)
h2, 1, 0i
h3, 0, 0i h2, 2, 0i
(h2, 2, 0i, m h2, 2, 2i
2)
Application Application
Network
Receiving versus delivering
Node A: Node B:
Application Application
broadcast
Network
Receiving versus delivering
Node A: Node B:
Application Application
broadcast
Network
Application Application
broadcast deliver
Network
Causal broadcast:
If broadcast(m1 ) → broadcast(m2 ) then m1 must be delivered
before m2
Forms of reliable broadcast
FIFO broadcast:
If m1 and m2 are broadcast by the same node, and
broadcast(m1 ) → broadcast(m2 ), then m1 must be delivered
before m2
Causal broadcast:
If broadcast(m1 ) → broadcast(m2 ) then m1 must be delivered
before m2
Causal broadcast:
If broadcast(m1 ) → broadcast(m2 ) then m1 must be delivered
before m2
m3 m3
m3
FIFO broadcast
A B C
m1 m1
m1
m2 m2 m2
m3 m3
m3
m3 m3
m3
m3 m3
m3
m2 m2
Causal broadcast
A B C
m1 m1
m1
m3 m2 m2
m3
m3
Causal broadcast
A B C
m1 m1
m1
m3 m2 m2
m3
m3
m3 m2 m2
m3
m3
m2 m2
m3 m3
m3
A B C
m1 m1
m1
Total order broadcast (1)
A B C
m1 m1
m1
m2 m2
Total order broadcast (1)
A B C
m1 m1
m1
m3 m2 m2
m3
m3
Total order broadcast (1)
A B C
m1 m1
m1
m3 m2 m2
m3
m3
A B C
m1 m1
m1
m3 m2 m2
m3
m3
m2 m2
m3 m3
m3
m2
Total order
Causal broadcast
broadcast
FIFO broadcast
Reliable broadcast
Best-effort
= stronger than
broadcast
Broadcast algorithms
Break down into two layers:
1. Make best-effort broadcast reliable by retransmitting
dropped messages
2. Enforce delivery order on top of reliable broadcast
Broadcast algorithms
Break down into two layers:
1. Make best-effort broadcast reliable by retransmitting
dropped messages
2. Enforce delivery order on top of reliable broadcast
A B C
m1
m1
Eager reliable broadcast
Idea: the first time a node receives a particular message, it
re-broadcasts to each other node (via reliable links).
A B C
Eager reliable broadcast
Idea: the first time a node receives a particular message, it
re-broadcasts to each other node (via reliable links).
A B C
m1
m1
Eager reliable broadcast
Idea: the first time a node receives a particular message, it
re-broadcasts to each other node (via reliable links).
A B C
m1
m1
m1 m1
Eager reliable broadcast
Idea: the first time a node receives a particular message, it
re-broadcasts to each other node (via reliable links).
A B C
m1
m1
m1 m1
m1
m1
Eager reliable broadcast
Idea: the first time a node receives a particular message, it
re-broadcasts to each other node (via reliable links).
A B C
m1
m1
m1 m1
m1
m1
on initialisation do
sendSeq := 0; delivered := h0, 0, . . . , 0i; buffer := {}
end on
Replication
Replication
I Keeping a copy of the same data on multiple nodes
I Databases, filesystems, caches, . . .
I A node that has a copy of the data is called a replica
Replication
I Keeping a copy of the same data on multiple nodes
I Databases, filesystems, caches, . . .
I A node that has a copy of the data is called a replica
I If some replicas are faulty, others are still accessible
I Spread load across many replicas
Replication
I Keeping a copy of the same data on multiple nodes
I Databases, filesystems, caches, . . .
I A node that has a copy of the data is called a replica
I If some replicas are faulty, others are still accessible
I Spread load across many replicas
I Easy if the data doesn’t change: just copy it
I We will focus on data changes
Replication
I Keeping a copy of the same data on multiple nodes
I Databases, filesystems, caches, . . .
I A node that has a copy of the data is called a replica
I If some replicas are faulty, others are still accessible
I Spread load across many replicas
I Easy if the data doesn’t change: just copy it
I We will focus on data changes
client
Retrying state updates
User A: The moon is not actually made of cheese!
Like 12,300 people like this.
client
increment post.likes
ack 12,301
Retrying state updates
User A: The moon is not actually made of cheese!
Like 12,300 people like this.
client
increment post.likes
ack 12,301
increment post.likes
ack 12,302
Retrying state updates
User A: The moon is not actually made of cheese!
Like 12,300 people like this.
client
increment post.likes
ack 12,301
increment post.likes
ack 12,302
client 1 client 2
Adding and then removing again
client 1 client 2
f : add like
ack
client 1 client 2
f : add like
ack
set of likes
client 1 client 2
f : add like
ack
set of likes
g : unlike
ack
client 1 client 2
f : add like
ack
set of likes
g : unlike
f : add like ack
ack
client 1 client 2
f : add like
ack
set of likes
g : unlike
f : add like ack
ack
Final state (x ∈
/ A, x ∈ B) is the same as in this case:
Another problem with adding and removing
client A B
add(x)
add(x)
remove(x)
remove(x)
Final state (x ∈
/ A, x ∈ B) is the same as in this case:
client A B
add(x)
add(x)
Timestamps and tombstones
client A B
(t1 , add(x))
t1
(t1 , add(x))
{x 7→ (t1 , true)}
{x 7→ (t1 , true)}
Timestamps and tombstones
client A B
(t1 , add(x))
t1
(t1 , add(x))
{x 7→ (t1 , true)}
{x 7→ (t1 , true)}
(t2 , remove(x
t2 ))
(t2 , remove(x
))
{x 7→ (t2 , false)}
Timestamps and tombstones
client A B
(t1 , add(x))
t1
(t1 , add(x))
{x 7→ (t1 , true)}
{x 7→ (t1 , true)}
(t2 , remove(x
t2 ))
(t2 , remove(x
))
{x 7→ (t2 , false)}
client A B
(t1 , add(x))
t1
(t1 , add(x))
{x 7→ (t1 , true)}
{x 7→ (t1 , true)}
(t2 , remove(x
t2 ))
(t2 , remove(x
))
{x 7→ (t2 , false)}
A B
A B
reconcile state
{x 7→ (t2 , false)} {x 7→ (t1 , true)}
(anti-entropy)
Reconciling replicas
Replicas periodically communicate among themselves
to check for any inconsistencies.
A B
reconcile state
{x 7→ (t2 , false)} {x 7→ (t1 , true)}
(anti-entropy)
A B
reconcile state
{x 7→ (t2 , false)} {x 7→ (t1 , true)}
(anti-entropy)
client 1 A B client 2
Concurrent writes by different clients
client 1 A B client 2
(t1 , set
t1 (x, v )
1 )
Concurrent writes by different clients
client 1 A B client 2
(t1 , set )
(x, v ) (x, v2)
t1 1 ) (t2, set t2
Concurrent writes by different clients
client 1 A B client 2
(t1 , set )
(x, v ) (x, v2)
t1 1 ) (t2, set t2
client 1 A B client 2
(t1 , set )
(x, v ) (x, v2)
t1 1 ) (t2, set t2
client A B
(t1 , set(x, v ))
t1 1
Read-after-write consistency
client A B
(t1 , set(x, v ))
t1 1
get(x)
(t0, v0)
Read-after-write consistency
client A B
(t1 , set(x, v ))
t1 1
get(x)
(t0, v0)
client A B
(t1 , set(x, v ))
t1 1
get(x)
(t0, v0)
ok ok
ok ok
get(x)
ok ok
get(x)
ok ok
get(x)
A B C D E
A B C D E
client A B C
Read repair
client A B C
get(x)
Read repair
client A B C
get(x)
client A B C
get(x)
(t1 , set(x, v ))
1
client 1 client 2 L F
Database leader replica
Leader database replica L ensures total order broadcast
client 1 client 2 L F
T1
T1
Database leader replica
Leader database replica L ensures total order broadcast
client 1 client 2 L F
T1
T2 T1
T2
Database leader replica
Leader database replica L ensures total order broadcast
client 1 client 2 L F
T1
T2 T1
T2
ok commit
client 1 client 2 L F
T1
T2 T1
T2
ok commit
ok commit
Consensus
Fault-tolerant total order broadcast
Total order broadcast is very useful for state machine
replication.
Can implement total order broadcast by sending all messages
via a single leader.
Problem: what if leader crashes/becomes unavailable?
Fault-tolerant total order broadcast
Total order broadcast is very useful for state machine
replication.
Can implement total order broadcast by sending all messages
via a single leader.
Problem: what if leader crashes/becomes unavailable?
A B C D E
Leader election
Multi-Paxos, Raft, etc. use a leader to sequence messages.
I Use a failure detector (timeout) to determine suspected
crash or unavailability of leader.
I On suspected leader crash, elect a new one.
I Prevent two leaders at the same time (“split-brain”)!
Ensure ≤ 1 leader per term:
I Term is incremented every time a leader election is started
I A node can only vote once per term
I Require a quorum of nodes to elect a leader in a term
A B C D E
elects a leader
Leader election
Multi-Paxos, Raft, etc. use a leader to sequence messages.
I Use a failure detector (timeout) to determine suspected
crash or unavailability of leader.
I On suspected leader crash, elect a new one.
I Prevent two leaders at the same time (“split-brain”)!
Ensure ≤ 1 leader per term:
I Term is incremented every time a leader election is started
I A node can only vote once per term
I Require a quorum of nodes to elect a leader in a term
A B C D E
Node 1 may not even know that a new leader has been elected!
Checking if a leader has been voted out
For every decision (message to deliver), the leader must first
get acknowledgements from a quorum.
Can we deliv
er message m
next in term
t?
Checking if a leader has been voted out
For every decision (message to deliver), the leader must first
get acknowledgements from a quorum.
Can we deliv
er message m
next in term
t?
okay okay
Checking if a leader has been voted out
For every decision (message to deliver), the leader must first
get acknowledgements from a quorum.
Can we deliv
er message m
next in term
t?
okay okay
Right, now d
eliver m plea
se
Node state transitions in Raft
starts up
or recovers
from crash
starts up
or recovers
from crash
suspects
leader failure
starts up
or recovers
from crash
suspects
leader failure
receives votes
from quorum
Follower Candidate Leader
Node state transitions in Raft
starts up
or recovers
from crash
suspects
leader failure
receives votes
from quorum
Follower Candidate Leader
discovers
new term
Node state transitions in Raft
starts up
or recovers election
from crash times out
suspects
leader failure
receives votes
from quorum
Follower Candidate Leader
discovers
new term
Node state transitions in Raft
starts up
or recovers election
from crash times out
suspects
leader failure
receives votes
from quorum
Follower Candidate Leader
discovers
new term
function CommitLogEntries
minAcks := d(|nodes| + 1)/2e
ready := {len ∈ {1, . . . , log.length} | acks(len) ≥ minAcks}
if ready 6= {} ∧ max(ready) > commitLength ∧
log[max(ready) − 1].term = currentTerm then
for i := commitLength to max(ready) − 1 do
deliver log[i].msg to the application
end for
commitLength := max(ready)
end if
end function
Lecture 7
Replica consistency
“Consistency”
A word that means many different things in different contexts!
“Consistency”
A word that means many different things in different contexts!
I ACID: a transaction transforms the database from one
“consistent” state to another
“Consistency”
A word that means many different things in different contexts!
I ACID: a transaction transforms the database from one
“consistent” state to another
Here, “consistent” = satisfying application-specific
invariants
e.g. “every course with students enrolled must have at
least one lecturer”
“Consistency”
A word that means many different things in different contexts!
I ACID: a transaction transforms the database from one
“consistent” state to another
Here, “consistent” = satisfying application-specific
invariants
e.g. “every course with students enrolled must have at
least one lecturer”
I Read-after-write consistency (lecture 5)
“Consistency”
A word that means many different things in different contexts!
I ACID: a transaction transforms the database from one
“consistent” state to another
Here, “consistent” = satisfying application-specific
invariants
e.g. “every course with students enrolled must have at
least one lecturer”
I Read-after-write consistency (lecture 5)
I Replication: replica should be “consistent” with other
replicas
“Consistency”
A word that means many different things in different contexts!
I ACID: a transaction transforms the database from one
“consistent” state to another
Here, “consistent” = satisfying application-specific
invariants
e.g. “every course with students enrolled must have at
least one lecturer”
I Read-after-write consistency (lecture 5)
I Replication: replica should be “consistent” with other
replicas
“consistent” = in the same state? (when exactly?)
“consistent” = read operations return same result?
I Consistency model: many to choose from
Distributed transactions
Recall atomicity in the context of ACID transactions:
I A transaction either commits or aborts
Distributed transactions
Recall atomicity in the context of ACID transactions:
I A transaction either commits or aborts
I If it commits, its updates are durable
I If it aborts, it has no visible side-effects
Distributed transactions
Recall atomicity in the context of ACID transactions:
I A transaction either commits or aborts
I If it commits, its updates are durable
I If it aborts, it has no visible side-effects
I ACID consistency (preserving invariants) relies on
atomicity
Distributed transactions
Recall atomicity in the context of ACID transactions:
I A transaction either commits or aborts
I If it commits, its updates are durable
I If it aborts, it has no visible side-effects
I ACID consistency (preserving invariants) relies on
atomicity
client coordinator A B
Two-phase commit (2PC)
client coordinator A B
begin T1
T1
T1
Two-phase commit (2PC)
client coordinator A B
begin T1
T1
. . . usual transaction execution. . . T1
Two-phase commit (2PC)
client coordinator A B
begin T1
T1
. . . usual transaction execution. . . T1
commit T
1
Two-phase commit (2PC)
client coordinator A B
begin T1
T1
. . . usual transaction execution. . . T1
commit T
1
prepare
Two-phase commit (2PC)
client coordinator A B
begin T1
T1
. . . usual transaction execution. . . T1
commit T
1
prepare
ok ok
Two-phase commit (2PC)
client coordinator A B
begin T1
T1
. . . usual transaction execution. . . T1
commit T
1
prepare
ok ok
commit
Two-phase commit (2PC)
client coordinator A B
begin T1
T1
. . . usual transaction execution. . . T1
commit T
1
prepare
ok ok
decision whether
to commit or abort
commit
The coordinator in two-phase commit
client A B C
Read-after-write consistency revisited
client A B C
(t1 , set(x, v
1 ))
set(x, v1 )
Read-after-write consistency revisited
client A B C
(t1 , set(x, v
1 ))
set(x, v1 )
ok ok
Read-after-write consistency revisited
client A B C
(t1 , set(x, v
1 ))
set(x, v1 )
ok ok
get(x)
get(x) → v1
Read-after-write consistency revisited
client A B C
(t1 , set(x, v
1 ))
set(x, v1 )
ok ok
get(x)
get(x) → v1
operation returns
get(x) → v1
From the client’s point of view
client 1 I Focus on client-observable
behaviour: when and what an
set(x, v1 )
? operation returns
I Ignore how the replication
system is implemented internally
?
get(x) → v1
?
From the client’s point of view
client 1 I Focus on client-observable
behaviour: when and what an
set(x, v1 )
? operation returns
I Ignore how the replication
system is implemented internally
? I Did operation A finish before
operation B started?
real time
get(x) → v1
?
From the client’s point of view
client 1 client 2 I Focus on client-observable
behaviour: when and what an
set(x, v1 )
? operation returns
I Ignore how the replication
system is implemented internally
? I Did operation A finish before
real time operation B started?
I Even if the operations are on
get(x) → v1
?
different nodes?
?
From the client’s point of view
client 1 client 2 I Focus on client-observable
behaviour: when and what an
set(x, v1 )
? operation returns
I Ignore how the replication
system is implemented internally
? I Did operation A finish before
real time operation B started?
I Even if the operations are on
get(x) → v1
?
different nodes?
I This is not happens-before:
we want client 2 to read value
?
written by client 1, even if the
clients have not communicated!
Operations overlapping in time
client 1 client 2
I Client 2’s get operation
overlaps in time with
client 1’s set operation
set(x, v1 )
get(x) → v1
Operations overlapping in time
client 1 client 2
I Client 2’s get operation
overlaps in time with
client 1’s set operation
set(x, v1 )
client 1 client 2
I Client 2’s get operation
overlaps in time with
client 1’s set operation
set(x, v1 )
client 1 client 2
I Client 2’s get operation
overlaps in time with
client 1’s set operation
set(x, v1 )
ok
Not linearizable, despite quorum reads/writes
get(x) → v1
ok
Not linearizable, despite quorum reads/writes
get(x) → v1
ok
(t1 , v1 )
(t0 , v0 )
Not linearizable, despite quorum reads/writes
get(x) → v1
ok
(t1 , v1 )
(t0 , v0 )
get(x)
get(x) → v0
Not linearizable, despite quorum reads/writes
get(x) → v1
ok
(t1 , v1 )
(t0 , v0 )
get(x)
get(x) → v0
(t0 , v0 )
(t0 , v0 )
Not linearizable, despite quorum reads/writes
get(x) → v1
ok
(t1 , v1 )
(t0 , v0 )
get(x)
get(x) → v0
(t0 , v0 )
(t0 , v0 )
Not linearizable, despite quorum reads/writes
get(x) → v1
ok
(t1 , v1 )
(t0 , v0 )
get(x)
get(x) → v0
(t0 , v0 )
ok (t0 , v0 )
ok
Not linearizable, despite quorum reads/writes
client 1 client 2 client 3
set(x, v1 )
get(x) → v1
get(x) → v0
Not linearizable, despite quorum reads/writes
client 1 client 2 client 3
set(x, v1 )
get(x) → v1
real time
get(x) → v0
Not linearizable, despite quorum reads/writes
client 1 client 2 client 3
I Client 2’s operation finishes
set(x, v1 )
get(x) → v1
starts
I Linearizability therefore
requires client 3’s operation
real time
to observe a state no older
get(x) → v0
than client 2’s operation
I This example violates
linearizability because v0 is
older than v1
Making quorum reads/writes linearizable
ok
get(x) → v1
get(x) → v1
...
...
Making quorum reads/writes linearizable
ok
get(x) → v1
get(x) → v1
...
...
Making quorum reads/writes linearizable
ok
get(x) → v1
(t1 , v1 )
(t0 , v0 )
get(x) → v1
...
...
Making quorum reads/writes linearizable
ok
get(x) → v1
(t1 , v1 )
(t0 , v0 )
))
(t1 , set(x, v1
get(x) → v1
...
...
Making quorum reads/writes linearizable
ok
get(x) → v1
(t1 , v1 )
(t0 , v0 )
))
(t1 , set(x, v1
get(x) → v1
...
ok
ok
...
Making quorum reads/writes linearizable
ok
get(x) → v1
(t1 , v1 )
(t0 , v0 )
))
(t1 , set(x, v1
get(x) → v1
...
ok
ok
ok ...
ok
Linearizability for different types of operation
This ensures linearizability of get (quorum read) and
set (blind write to quorum)
Linearizability for different types of operation
This ensures linearizability of get (quorum read) and
set (blind write to quorum)
I When an operation finishes, the value read/written is
stored on a quorum of replicas
I Every subsequent quorum operation will see that value
Linearizability for different types of operation
This ensures linearizability of get (quorum read) and
set (blind write to quorum)
I When an operation finishes, the value read/written is
stored on a quorum of replicas
I Every subsequent quorum operation will see that value
I Multiple concurrent writes may overwrite each other
Linearizability for different types of operation
This ensures linearizability of get (quorum read) and
set (blind write to quorum)
I When an operation finishes, the value read/written is
stored on a quorum of replicas
I Every subsequent quorum operation will see that value
I Multiple concurrent writes may overwrite each other
Downsides:
I Performance cost: lots of messages and waiting for
responses
Eventual consistency
Linearizability advantages:
I Makes a distributed system behave as if it were
non-distributed
I Simple for applications to use
Downsides:
I Performance cost: lots of messages and waiting for
responses
I Scalability limits: leader can be a bottleneck
Eventual consistency
Linearizability advantages:
I Makes a distributed system behave as if it were
non-distributed
I Simple for applications to use
Downsides:
I Performance cost: lots of messages and waiting for
responses
I Scalability limits: leader can be a bottleneck
I Availability problems: if you can’t contact a quorum of
nodes, you can’t process any operations
Eventual consistency
Linearizability advantages:
I Makes a distributed system behave as if it were
non-distributed
I Simple for applications to use
Downsides:
I Performance cost: lots of messages and waiting for
responses
I Scalability limits: leader can be a bottleneck
I Availability problems: if you can’t contact a quorum of
nodes, you can’t process any operations
node A
set(x, v1 ) node B node C
network partition
get(x) → v1
get(x) → v1
get(x) → v0
C must either wait indefinitely for the network to recover, or
return a potentially stale value
Eventual consistency
Replicas process operations based only on their local state.
If there are no more updates, eventually all replicas will be in
the same state. (No guarantees how long it might take.)
Eventual consistency
Replicas process operations based only on their local state.
If there are no more updates, eventually all replicas will be in
the same state. (No guarantees how long it might take.)
Strong eventual consistency:
I Eventual delivery: every update made to one non-faulty
replica is eventually processed by every non-faulty replica.
Eventual consistency
Replicas process operations based only on their local state.
If there are no more updates, eventually all replicas will be in
the same state. (No guarantees how long it might take.)
Strong eventual consistency:
I Eventual delivery: every update made to one non-faulty
replica is eventually processed by every non-faulty replica.
I Convergence: any two replicas that have processed the
same set of updates are in the same state
(even if updates were processed in a different order).
Eventual consistency
Replicas process operations based only on their local state.
If there are no more updates, eventually all replicas will be in
the same state. (No guarantees how long it might take.)
Strong eventual consistency:
I Eventual delivery: every update made to one non-faulty
replica is eventually processed by every non-faulty replica.
I Convergence: any two replicas that have processed the
same set of updates are in the same state
(even if updates were processed in a different order).
Properties:
I Does not require waiting for network communication
I Causal broadcast (or weaker) can disseminate updates
Eventual consistency
Replicas process operations based only on their local state.
If there are no more updates, eventually all replicas will be in
the same state. (No guarantees how long it might take.)
Strong eventual consistency:
I Eventual delivery: every update made to one non-faulty
replica is eventually processed by every non-faulty replica.
I Convergence: any two replicas that have processed the
same set of updates are in the same state
(even if updates were processed in a different order).
Properties:
I Does not require waiting for network communication
I Causal broadcast (or weaker) can disseminate updates
I Concurrent updates =⇒ conflicts need to be resolved
Summary of minimum system model requirements
strength of assumptions
Summary of minimum system model requirements
strength of assumptions
consensus, quorum partially
total order broadcast, synchronous
linearizable CAS
Summary of minimum system model requirements
strength of assumptions
consensus, quorum partially
total order broadcast, synchronous
linearizable CAS
linearizable get/set quorum asynchronous
Summary of minimum system model requirements
strength of assumptions
consensus, quorum partially
total order broadcast, synchronous
linearizable CAS
linearizable get/set quorum asynchronous
eventual consistency, local replica only asynchronous
causal broadcast,
FIFO broadcast
Lecture 8
Families of algorithms:
I Conflict-free Replicated Data Types (CRDTs)
I Operation-based
I State-based
I Operational Transformation (OT)
Conflicts due to concurrent updates
node A node B
network partition
{ {
"title": "Lecture", "title": "Lecture",
"date": "2020-11-05", "date": "2020-11-05",
"time": "12:00" "time": "12:00"
} }
Conflicts due to concurrent updates
node A node B
network partition
{ {
"title": "Lecture", "title": "Lecture",
"date": "2020-11-05", "date": "2020-11-05",
"time": "12:00" "time": "12:00"
} }
network partition
{ {
"title": "Lecture", "title": "Lecture",
"date": "2020-11-05", "date": "2020-11-05",
"time": "12:00" "time": "12:00"
} }
network partition
{ {
"title": "Lecture", "title": "Lecture",
"date": "2020-11-05", "date": "2020-11-05",
"time": "12:00" "time": "12:00"
} }
{ {
"title": "Lecture 1", "title": "Lecture 1",
"date": "2020-11-05", "date": "2020-11-05",
"time": "10:00" "time": "10:00"
} }
Operation-based map CRDT
on initialisation do
values := {}
end on
user A user B
network partition
B C B C
0 1 0 1
Collaborative text editing: the problem
user A user B
network partition
B C B C
0 1 0 1
insert(0, “A”)
A B C
0 1 2
Collaborative text editing: the problem
user A user B
network partition
B C B C
0 1 0 1
A B C B C D
0 1 2 0 1 2
Collaborative text editing: the problem
user A user B
network partition
B C B C
0 1 0 1
A B C B C D
0 1 2 0 1 2
(insert, 0, “A”)
A B C D
0 1 2 3
Collaborative text editing: the problem
user A user B
network partition
B C B C
0 1 0 1
A B C B C D
0 1 2 0 1 2
A B D C A B C D
0 1 2 3 0 1 2 3
Operational transformation
user A user B
insert(0, “A”) insert(2, “D”)
A B C B C D
0 1 2 0 1 2
Operational transformation
user A user B
insert(0, “A”) insert(2, “D”)
A B C B C D
0 1 2 0 1 2
(insert, 2, “D”)
T ((insert, 2, “D”),
(insert, 0, “A”)) =
(insert, 3, “D”)
A B C D
0 1 2 3
Operational transformation
user A user B
insert(0, “A”) insert(2, “D”)
A B C B C D
0 1 2 0 1 2
A B C D A B C D
0 1 2 3 0 1 2 3
Text editing CRDT
user A user B
` B C a ` B C a
0.0 0.5 0.75 1.0 0.0 0.5 0.75 1.0
Text editing CRDT
user A user B
` B C a ` B C a
0.0 0.5 0.75 1.0 0.0 0.5 0.75 1.0
insert(0.25, “A”)
` A B C a
0.0 0.25 0.5 0.75 1.0
Text editing CRDT
user A user B
` B C a ` B C a
0.0 0.5 0.75 1.0 0.0 0.5 0.75 1.0
` A B C a ` B C D a
0.0 0.25 0.5 0.75 1.0 0.0 0.5 0.75 0.875 1.0
Text editing CRDT
user A user B
` B C a ` B C a
0.0 0.5 0.75 1.0 0.0 0.5 0.75 1.0
` A B C a ` B C D a
0.0 0.25 0.5 0.75 1.0 0.0 0.5 0.75 0.875 1.0
` A B C D a
0.0 0.25 0.5 0.75 0.875 1.0
Text editing CRDT
user A user B
` B C a ` B C a
0.0 0.5 0.75 1.0 0.0 0.5 0.75 1.0
` A B C a ` B C D a
0.0 0.25 0.5 0.75 1.0 0.0 0.5 0.75 0.875 1.0
` A B C D a ` A B C D a
0.0 0.25 0.5 0.75 0.875 1.0 0.0 0.25 0.5 0.75 0.875 1.0
Operation-based text CRDT (1/2)
function ElementAt(chars, index )
min = the unique triple (p, n, v) ∈ chars such that
@(p0 , n0 , v 0 ) ∈ chars. p0 < p ∨ (p0 = p ∧ n0 < n)}
if index = 0 then return min
else return ElementAt(chars \ {min}, index − 1)
end function
on initialisation do
chars := {(0, null, `), (1, null, a)}
end on
Consistency properties:
I Serializable transaction isolation
I Linearizable reads and writes
Google’s Spanner
A database system with millions of nodes, petabytes of data,
distributed across datacenters worldwide
Consistency properties:
I Serializable transaction isolation
I Linearizable reads and writes
I Many shards, each holding a subset of the data;
atomic commit of transactions across shards
Google’s Spanner
A database system with millions of nodes, petabytes of data,
distributed across datacenters worldwide
Consistency properties:
I Serializable transaction isolation
I Linearizable reads and writes
I Many shards, each holding a subset of the data;
atomic commit of transactions across shards
Consistency properties:
I Serializable transaction isolation
I Linearizable reads and writes
I Many shards, each holding a subset of the data;
atomic commit of transactions across shards
A B
Obtaining commit timestamps
Must ensure that whenever T1 → T2 we have t1 < t2 .
I Physical clocks may be inconsistent with causality
I Can we use Lamport clocks instead?
I Problem: linearizability depends on real-time order, and
logical clocks may not reflect this!
A B
T1
results
Obtaining commit timestamps
Must ensure that whenever T1 → T2 we have t1 < t2 .
I Physical clocks may be inconsistent with causality
I Can we use Lamport clocks instead?
I Problem: linearizability depends on real-time order, and
logical clocks may not reflect this!
A B
T1
results
action
T2
TrueTime: explicit physical clock uncertainty
physical
A time B
T1
TrueTime: explicit physical clock uncertainty
Spanner’s TrueTime clock returns [tearliest , tlatest ].
True physical timestamp must lie within that range.
physical
A time B
T1 t1,earliest
commit req
t1,latest
TrueTime: explicit physical clock uncertainty
Spanner’s TrueTime clock returns [tearliest , tlatest ].
True physical timestamp must lie within that range.
On commit, wait for uncertainty δi = ti,latest − ti,earliest .
physical
A time B
T1 t1,earliest
commit req δ1
wait
δ1 t1,latest
commit done
TrueTime: explicit physical clock uncertainty
Spanner’s TrueTime clock returns [tearliest , tlatest ].
True physical timestamp must lie within that range.
On commit, wait for uncertainty δi = ti,latest − ti,earliest .
physical
A time B
T1 t1,earliest
commit req δ1
wait
δ1 t1,latest
commit done
t2,earliest T2
commit req
t2,latest
TrueTime: explicit physical clock uncertainty
Spanner’s TrueTime clock returns [tearliest , tlatest ].
True physical timestamp must lie within that range.
On commit, wait for uncertainty δi = ti,latest − ti,earliest .
physical
A time B
T1 t1,earliest
commit req δ1
wait
δ1 t1,latest
commit done
t2,earliest T2
δ2 commit req
wait
t2,latest δ2
commit done
TrueTime: explicit physical clock uncertainty
Spanner’s TrueTime clock returns [tearliest , tlatest ].
True physical timestamp must lie within that range.
On commit, wait for uncertainty δi = ti,latest − ti,earliest .
physical
A time B
T1 t1,earliest
commit req δ1
wait
δ1 t1,latest
commit done real time
t2,earliest T2
δ2 commit req
wait
t2,latest δ2
commit done
Determining clock uncertainty in TrueTime
Clock servers with atomic clock or GPS receiver in each
datacenter; servers report their clock uncertainty.
2
time [s]
0
0 10 20 30 40 50 60 70 80 90
Determining clock uncertainty in TrueTime
Clock servers with atomic clock or GPS receiver in each
datacenter; servers report their clock uncertainty.
Each node syncs its quartz clock with a server every 30 sec.
2
time [s]
0
0 10 20 30 40 50 60 70 80 90
2
time [s]
0
0 10 20 30 40 50 60 70 80 90
2
time [s]
server uncertainty + round trip time to clock server
0
0 10 20 30 40 50 60 70 80 90
Summary:
I Distributed systems are everywhere
I You use them every day: e.g. web apps
I Key goals: availability, scalability, performance
I Key problems: concurrency, faults, unbounded latency
I Key abstractions: replication, broadcast, consensus
I No one right way, just trade-offs