Distributed Systems: Chapter 07: Consistency & Replication

Distributed Systems
(3rd Edition)
Chapter 07: Consistency & Replication

Version: February 25, 2017
Consistency and replication: Introduction Reasons for replication
Performance and scalability
Main issue
To keep replicas consistent, we generally need to ensure that all conflicting
operations are done in the the same order everywhere
Conflicting operations: From the world of transactions

Read–write conflict: a read operation and a write operation act
concurrently
Write–write conflict: two concurrent write operations
Issue
Guaranteeing global ordering on conflicting operations may be a costly
operation, downgrading scalability Solution: weaken consistency requirements
so that hopefully global synchronization can be avoided
2 / 33
Consistency and replication: Data-centric consistency models
Data-centric consistency models
Consistency model
A contract between a (distributed) data store and processes, in which the data
store specifies precisely what the results of read and write operations are in
the presence of concurrency.
Essential
A data store is a distributed collection of storages:
Process Process Process
Local copy
Distributed data store
3 / 33
Consistency and replication: Data-centric consistency models Continuous consistency
Continuous Consistency
We can actually talk about a degree of consistency

replicas may differ in their numerical value
replicas may differ in their relative staleness
there may be differences with respect to (number and order) of performed
update operations
Conit
Consistency unit ⇒ specifies the data unit over which consistency is to be
measured.
4 / 33
Example: Conit
Replica A Replica B
d = 558 // distance d = 412 // distance

Conit g = 95 // gas Conit g = 45 // gas
p = 78 // price p = 70 // price
Operation Result Operation Result

< 5, B> g ¬ g + 45 [ g = 45 ] < 5, B> g ¬ g + 45 [g= 45 ]
< 8, A> g ¬ g + 50 [ g = 95 ] < 6, B> p ¬ p + 70 [p= 70 ]
< 9, A> p ¬ p + 78 [ p = 78 ] < 7, B> d ¬ d + 412 [ d = 412 ]
<10, A> d ¬ d + 558 [ d = 558 ]
Vector clock A = (11, 5) Vector clock B = (0, 8)

Order deviation =3 Order deviation =1
Numerical deviation = (2, 482) Numerical deviation = (3, 686)
Conit (contains the variables g, p, and d)

Each replica has a vector clock: ([known] time @ A, [known] time @ B)
B sends A operation [h5, Bi : g ← d + 45]; A has made this operation
permanent (cannot be rolled back)
The notion of a conit 5 / 33
Example: Conit
Replica A Replica B
d = 558 // distance d = 412 // distance

Conit g = 95 // gas Conit g = 45 // gas
p = 78 // price p = 70 // price
Operation Result Operation Result

< 5, B> g ¬ g + 45 [ g = 45 ] < 5, B> g ¬ g + 45 [g= 45 ]
< 8, A> g ¬ g + 50 [ g = 95 ] < 6, B> p ¬ p + 70 [p= 70 ]
< 9, A> p ¬ p + 78 [ p = 78 ] < 7, B> d ¬ d + 412 [ d = 412 ]
<10, A> d ¬ d + 558 [ d = 558 ]
Vector clock A = (11, 5) Vector clock B = (0, 8)

Order deviation =3 Order deviation =1
Numerical deviation = (2, 482) Numerical deviation = (3, 686)
Conit (contains the variables g, p, and d)

A has three pending operations ⇒ order deviation = 3
A missed two operations from B; max diff is 70 + 412 units ⇒ (2, 482)
The notion of a conit 6 / 33

Consistency and replication: Data-centric consistency models Consistent ordering of operations
Sequential consistency
Definition
The result of any execution is the same as if the operations of all processes
were executed in some sequential order, and the operations of each individual
process appear in this sequence in the order specified by its program.
(a) A sequentially consistent data store. (b) A data store that is not sequentially
consistent
P1: W(x)a P1: W(x)a
P2: W(x)b P2: W(x)b
P3: R(x)b R(x)a P3: R(x)b R(x)a
P4: R(x)b R(x)a P4: R(x)a R(x)b
(a) (b)
Sequential consistency 7 / 33
Causal consistency
Definition
Writes that are potentially causally related must be seen by all processes in the
same order. Concurrent writes may be seen in a different order by different
processes.
(a) A violation of a causally-consistent store. (b) A correct sequence of events

in a causally-consistent store
P1: W(x)a P1: W(x)a
P2: R(x)a W(x)b P2: W(x)b
P3: R(x)b R(x)a P3: R(x)b R(x)a
P4: R(x)a R(x)b P4: R(x)a R(x)b
(a) (b)
Causal consistency 8 / 33
Grouping operations
Definition
Accesses to locks are sequentially consistent.
No access to a lock is allowed to be performed until all previous writes
have completed everywhere.
No data access is allowed to be performed until all previous accesses to
locks have been performed.
Grouping operations 9 / 33
Grouping operations
Definition
Accesses to locks are sequentially consistent.
No access to a lock is allowed to be performed until all previous writes
have completed everywhere.
No data access is allowed to be performed until all previous accesses to
locks have been performed.
Basic idea
You don’t care that reads and writes of a series of operations are immediately
known to other processes. You just want the effect of the series itself to be
known.
Grouping operations
A valid event sequence for entry consistency

P1: L(x) W(x)a L(y) W(y)b U(x) U(y)
P2: L(x) R(x)a R(y) NIL
P3: L(y) R(y)b
Observation
Entry consistency implies that we need to lock and unlock data (implicitly or
not).
Question
What would be a convenient way of making this consistency more or less
transparent to programmers?
Consistency and replication: Client-centric consistency models
Consistency for mobile users
Example
Consider a distributed database to which you have access through your
notebook. Assume your notebook acts as a front end to the database.
At location A you access the database doing reads and updates.
At location B you continue your work, but unless you access the same
server as the one at location A, you may detect inconsistencies:
your updates at A may not have yet been propagated to B
you may be reading newer entries than the ones available at A
your updates at B may eventually conflict with those at A
Note
The only thing you really want is that the entries you updated and/or read at A,
are in B the way you left them in A. In that case, the database will appear to be
consistent to you.
11 / 33
Consistency and replication: Client-centric consistency models
Basic architecture
The principle of a mobile user accessing different replicas of a distributed
database
Client moves to other location
and (transparently) connects to
other replica
Replicas need to maintain

client-centric consistency
Wide-area network
Distributed and replicated database

Read and write operations
Portable computer
12 / 33
Consistency and replication: Client-centric consistency models Monotonic reads
Monotonic reads
Definition
If a process reads the value of a data item x, any successive read operation on
x by that process will always return that same or a more recent value.
The read operations performed by a single process P at two different local

copies of the same data store. (a) A monotonic-read consistent data store.
(b) A data store that does not provide monotonic reads
L1: W1(x1) R1(x1) L1: W1(x1) R1(x1)
L2: W2(x1;x2) R1(x2) L2: W2(x1|x2) R1(x2)
13 / 33
Client-centric consistency: notation
Notation
W1 (x2 ) is the write operation by process P1 that leads to version x2 of x
W1 (xi ; xj ) indicates P1 produces version xj based on a previous version xi .
W1 (xi |xj ) indicates P1 produces version xj concurrently to version xi .
14 / 33
Monotonic reads
Example
Automatically reading your personal calendar updates from different servers.
Monotonic Reads guarantees that the user sees all updates, no matter from
which server the automatic reading takes place.
Example
Reading (not modifying) incoming mail while you are on the move. Each time
you connect to a different e-mail server, that server fetches (at least) all the
updates from the server you previously visited.
15 / 33
Consistency and replication: Client-centric consistency models Monotonic writes
Monotonic writes
Definition
A write operation by a process on a data item x is completed before any
successive write operation on x by the same process.
(a) A monotonic-write consistent data store. (b) A data store that does not
provide monotonic-write consistency. (c) Again, no consistency as WS(x1 |x2)
and thus also WS(x1 |x3 ). (d) Consistent as WS(x1 ; x3 ) although x1 has
apparently overwritten x2 .
L1: W1(x1) L1: W1(x1)
L2: W2(x1;x2) W1(x2;x3) L2: W2(x1|x2) W1(x1|x3)
(a) (b)
L1: W1(x1) L1: W1(x1)

L2: W2(x1|x2) W1(x2;x3) L2: W2(x1|x2) W1(x1;x3)
(c) (d)
16 / 33
Consistency and replication: Client-centric consistency models Monotonic writes
Monotonic writes
Example
Updating a program at server S2 , and ensuring that all components on which
compilation and linking depends, are also placed at S2 .
Example
Maintaining versions of replicated files in the correct order everywhere
(propagate the previous version to the server where the newest version is
installed).
17 / 33
Consistency and replication: Client-centric consistency models Read your writes
Read your writes
Definition
The effect of a write operation by a process on data item x, will always be seen
by a successive read operation on x by the same process.
(a) A data store that provides read-your-writes consistency. (b) A data store
that does not.
L1: W1(x1) L1: W1(x1)
L2: W2(x1;x2) R1(x2) L2: W2(x1|x2) R1(x2)
(a) (b)
18 / 33
Consistency and replication: Client-centric consistency models Read your writes
Read your writes
Definition
The effect of a write operation by a process on data item x, will always be seen
by a successive read operation on x by the same process.
(a) A data store that provides read-your-writes consistency. (b) A data store
that does not.
L1: W1(x1) L1: W1(x1)
L2: W2(x1;x2) R1(x2) L2: W2(x1|x2) R1(x2)
(a) (b)
Example
Updating your Web page and guaranteeing that your Web browser shows the
newest version instead of its cached copy.
18 / 33
Consistency and replication: Client-centric consistency models Writes follow reads
Writes follow reads
Definition
A write operation by a process on a data item x following a previous read
operation on x by the same process, is guaranteed to take place on the same
or a more recent value of x that was read.
(a) A writes-follow-reads consistent data Example
store. (b) A data store that does not See reactions to posted articles
provide writes-follow-reads consistency only if you have the original
L1: W1(x1) R2(x1) posting (a read “pulls in” the
L2: W3(x1;x2) W2(x2;x3)
corresponding write operation).
(a)
L1: W1(x1) R2(x1)
L2: W3(x1|x2) W2(x1|x3)
(b)
19 / 33
Consistency and replication: Replica management Finding the best server location
Replica placement
Essence
Figure out what the best K places are out of N possible locations.
20 / 33
Replica placement
Essence
Select best location out of N − K for which the average distance to clients
is minimal. Then choose the next best server. (Note: The first chosen
location minimizes the average distance to all clients.) Computationally
expensive.
20 / 33
Replica placement
Essence
expensive.
Select the K -th largest autonomous system and place a server at the
best-connected host. Computationally expensive.
20 / 33
Replica placement
Essence
expensive.
Select the K -th largest autonomous system and place a server at the
best-connected host. Computationally expensive.
Position nodes in a d-dimensional geometric space, where distance
reflects latency. Identify the K regions with highest density and place a
server in every one. Computationally cheap.
20 / 33
Consistency and replication: Replica management Content replication and placement
Content replication
Distinguish different processes

A process is capable of hosting a replica of an object or data:
Permanent replicas: Process/machine always having a replica
Server-initiated replica: Process that can dynamically host a replica on
request of another server in the data store
Client-initiated replica: Process that can dynamically host a replica on
request of a client (client cache)
Permanent replicas 21 / 33
Content replication
The logical organization of different kinds of copies of a data store into three
concentric rings
Server-initiated replication
Client-initiated replication
Permanent
replicas
Server-initiated replicas
Client-initiated replicas
Clients
Permanent replicas 22 / 33
Server-initiated replicas
Counting access requests from different clients
C2
Server without
copy of file F
P
Client Server with
Q copy of F
C1
File F
Server Q counts access from C1 and

C2 as if they would come from P
Keep track of access counts per file, aggregated by considering server

closest to requesting clients
Number of accesses drops below threshold D ⇒ drop file
Number of accesses exceeds threshold R ⇒ replicate file
Number of access between D and R ⇒ migrate file
Server-initiated replicas 23 / 33
Consistency and replication: Replica management Content distribution
Content distribution
Consider only a client-server combination

Propagate only notification/invalidation of update (often used for caches)
Transfer data from one copy to another (distributed databases): passive
replication
Propagate the update operation to other copies: active replication
Note
No single approach is the best, but depends highly on available bandwidth and
read-to-write ratio at replicas.
State versus operations 24 / 33

Content distribution: client/server system
A comparison between push-based and pull-based protocols in the case of

multiple-client, single-server systems
Pushing updates: server-initiated approach, in which update is
propagated regardless whether target asked for it.
Pulling updates: client-initiated approach, in which client requests to be
updated.
Issue Push-based Pull-based

1: List of client caches None
2: Update (and possibly fetch update) Poll and update
3: Immediate (or fetch-update time) Fetch-update time
1: State at server
2: Messages to be exchanged
3: Response time at the client
Pull versus push protocols 25 / 33

Observation
We can dynamically switch between pulling and pushing using leases: A
contract in which the server promises to push updates to the client until the
lease expires.
Make lease expiration time dependent on system’s behavior (adaptive leases)

Observation
lease expires.

Age-based leases: An object that hasn’t changed for a long time, will not
change in the near future, so provide a long-lasting lease

Observation
lease expires.
Renewal-frequency based leases: The more often a client requests a

specific object, the longer the expiration time for that client (for that object)
will be

Observation
lease expires.
State-based leases: The more loaded a server is, the shorter the
expiration times become

Observation
lease expires.

Age-based leases: An object that hasn’t changed for a long time, will not
change in the near future, so provide a long-lasting lease
Renewal-frequency based leases: The more often a client requests a
specific object, the longer the expiration time for that client (for that object)
will be
State-based leases: The more loaded a server is, the shorter the
expiration times become
Question
Why are we doing all this?

Consistency and replication: Consistency protocols Continuous consistency
Continuous consistency: Numerical errors
Principal operation
Every server Si has a log, denoted as Li .
Consider a data item x and let val(W ) denote the numerical change in its
value after a write operation W . Assume that
∀W : val(W ) > 0
W is initially forwarded to one of the N replicas, denoted as origin(W ).

TW [i, j] are the writes executed by server Si that originated from Sj :
TW [i, j] = ∑{val(W )|origin(W ) = Sj & W ∈ Li }
Bounding numerical deviation 27 / 33

Note
Actual value v (t) of x:
N
v (t) = vinit + ∑ TW [k, k]
k =1
value vi of x at server Si :
N
vi = vinit + ∑ TW [i, k ]
k =1

Problem
We need to ensure that v (t) − vi < δi for every server Si .

Problem
Approach
Let every server Sk maintain a view TWk [i, j] of what it believes is the value of
TW [i, j]. This information can be gossiped when an update is propagated.

Problem
Approach
Let every server Sk maintain a view TWk [i, j] of what it believes is the value of
TW [i, j]. This information can be gossiped when an update is propagated.
Note
0 ≤ TWk [i, j] ≤ TW [i, j] ≤ TW [j, j]

Solution
Sk sends operations from its log to Si when it sees that TWk [i, k ] is getting too
far from TW [k , k ], in particular, when
TW [k, k] − TWk [i, k] > δi /(N − 1)

Solution
TW [k, k] − TWk [i, k] > δi /(N − 1)
Question
To what extent are we being pessimistic here: where does δi /(N − 1) come
from?

Solution
TW [k, k] − TWk [i, k] > δi /(N − 1)
Question
To what extent are we being pessimistic here: where does δi /(N − 1) come
from?
Note
Staleness can be done analogously, by essentially keeping track of what has
been seen last from Si (see book).

Consistency and replication: Consistency protocols Primary-based protocols
Primary-based protocols
Primary-backup protocol
Client Client
Primary server
for item x Backup server
W1 W5 R1 R2
W4 W4
W3 W3 Data store
W2 W3
W4
W1. Write request R1. Read request

W2. Forward request to primary R2. Response to read
W3. Tell backups to update
W4. Acknowledge update
W5. Acknowledge write completed
Remote-write protocols 31 / 33
Primary-backup protocol
Client Client
Primary server
for item x Backup server
W1 W5 R1 R2
W4 W4
W3 W3 Data store
W2 W3
W4

W2. Forward request to primary R2. Response to read
Example primary-backup protocol

Traditionally applied in distributed databases and file systems that require a
high degree of fault tolerance. Replicas are often placed on same LAN.
Remote-write protocols 31 / 33
Primary-backup protocol with local writes
Client Client
Old primary New primary
for item x for item x Backup server
R1 R2 W1 W3
W5 W5
W4 W4 Data store
W5 W2
W4

W2. Move item x to new primary R2. Response to read
Local-write protocols 32 / 33
Primary-backup protocol with local writes
Client Client
Old primary New primary
for item x for item x Backup server
R1 R2 W1 W3
W5 W5
W4 W4 Data store
W5 W2
W4

W2. Move item x to new primary R2. Response to read
Example primary-backup protocol with local writes

Mobile computing in disconnected mode (ship all relevant files to user before
disconnecting, and update later on).
Local-write protocols 32 / 33
Consistency and replication: Consistency protocols Replicated-write protocols
Replicated-write protocols
Quorum-based protocols
Ensure that each operation is carried out in such a way that a majority vote is
established: distinguish read quorum and write quorum
Three examples of the voting algorithm. (a) A correct choice of read and write
set. (b) A choice that may lead to write-write conflicts. (c) A correct choice,
known as ROWA (read one, write all)
A B C D A B C D A B C D
E F G H E F G H E F G H
I J K L I J K L I J K L
NR = 3, N W = 10 NR = 7, NW = 6 NR = 1, N W = 12
Quorum-based protocols 33 / 33

Distributed Systems: Chapter 07: Consistency & Replication

Uploaded by

Copyright:

Available Formats

Distributed Systems: Chapter 07: Consistency & Replication

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Systems: Chapter 07: Consistency & Replication

Uploaded by

Copyright:

Available Formats

Distributed Systems

Chapter 07: Consistency & Replication

Performance and scalability

Conflicting operations: From the world of transactions

Data-centric consistency models

Distributed data store

We can actually talk about a degree of consistency

d = 558 // distance d = 412 // distance

Operation Result Operation Result

< 8, A> g ¬ g + 50 [ g = 95 ] < 6, B> p ¬ p + 70 [p= 70 ]

< 9, A> p ¬ p + 78 [ p = 78 ] < 7, B> d ¬ d + 412 [ d = 412 ]

<10, A> d ¬ d + 558 [ d = 558 ]

Vector clock A = (11, 5) Vector clock B = (0, 8)

Conit (contains the variables g, p, and d)

d = 558 // distance d = 412 // distance

Operation Result Operation Result

< 8, A> g ¬ g + 50 [ g = 95 ] < 6, B> p ¬ p + 70 [p= 70 ]

< 9, A> p ¬ p + 78 [ p = 78 ] < 7, B> d ¬ d + 412 [ d = 412 ]

<10, A> d ¬ d + 558 [ d = 558 ]

Vector clock A = (11, 5) Vector clock B = (0, 8)

Conit (contains the variables g, p, and d)

The notion of a conit 6 / 33

(a) A violation of a causally-consistent store. (b) A correct sequence of events

A valid event sequence for entry consistency

Consistency for mobile users

Replicas need to maintain

Distributed and replicated database

The read operations performed by a single process P at two different local

Client-centric consistency: notation

L1: W1(x1) L1: W1(x1)

Read your writes

Read your writes

Writes follow reads

Distinguish different processes

Server Q counts access from C1 and

Keep track of access counts per file, aggregated by considering server

Consider only a client-server combination

State versus operations 24 / 33

Content distribution: client/server system

A comparison between push-based and pull-based protocols in the case of

Issue Push-based Pull-based

Pull versus push protocols 25 / 33

Make lease expiration time dependent on system’s behavior (adaptive leases)

Pull versus push protocols 26 / 33

Make lease expiration time dependent on system’s behavior (adaptive leases)

Pull versus push protocols 26 / 33

Make lease expiration time dependent on system’s behavior (adaptive leases)

Renewal-frequency based leases: The more often a client requests a

Pull versus push protocols 26 / 33

Make lease expiration time dependent on system’s behavior (adaptive leases)

Pull versus push protocols 26 / 33

Make lease expiration time dependent on system’s behavior (adaptive leases)

Pull versus push protocols 26 / 33

Continuous consistency: Numerical errors

W is initially forwarded to one of the N replicas, denoted as origin(W ).

TW [i, j] = ∑{val(W )|origin(W ) = Sj & W ∈ Li }

Bounding numerical deviation 27 / 33

Continuous consistency: Numerical errors

Bounding numerical deviation 28 / 33