Week-6 Lecture Notes
Week-6 Lecture Notes
EL
PT
N
0 mins
Geo-distributed Data Centers
EL
PT
N
Dr. Rajiv Misra, Professor
Dept. of Computer Science & Engineering
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Preface
Content of this Lecture:
• In this lecture, we will study the Geo-distributed
cloud data centers, interaction of data centers
with users and with other data centers.
EL
• We will also describe the data center
interconnection techniques such as (i) Traditional
schemes such as MPLS, (ii) Cutting edge such as
PT
Google’s B4 and (iii) Microsoft’s Swan
N
Inter-Data Center Networking: The Problem
• Today, the virtual use of any popular web application means to
communicate with a server in a data center.
• However the connectivity for this service depends on the internet
• Internet becomes crucial in application service’s performance
EL
• Data centers also communicate with each other over the internet.
Example: replicating client data across multiple data centers.
• In cloud scenario, the wide area connectivity or the internet is as
PT
crucial as the data center infrastructure.
The
N Internet
Why Multiple Data centers ?
• Why does a provider like Google need such an extensive infrastructure with so
many locations across a wide expanse of the globe?
• Better data availability: If one of the facilities goes down, due to a natural
disaster, you could still have the data be available at some other location if it is
replicated.
EL
• Load balancing: Multiple facilities can spread incoming and outgoing traffic
over the internet across a wider set of providers, over a wider geographic
regions.
PT
• Latency: If present in multiple paths of the globe then can reach clients in
different locations at smaller distances, thus reduces latency.
Local data laws: Several authority might require that companies store data
•
•
N
from that country in that jurisdiction itself.
Hybrid public-private operation: Can handle the average demand for service
from the private infrastructure, and then offload peak demand to the public
cloud.
Significant Inter-data center traffic
EL
• Study from five Yahoo
data centers from 2011.
PT
• The study is based on
anonymized traces from
border routers that
connect
datacenters.
to these
N
Fig.: Overview of five major Yahoo! data centers
and their network connectivity.
Significant Inter-data center traffic
Here two plots showing the number of flows between clients and the
data centers.
On the right, between the data centers themselves. The three lines on
each plot are for three different data center locations. Notice that the y-
EL
axis on these two plots are different. In terms of number of flows, the
traffic between data centers is 10% to 20% of the traffic from data centers
to clients.
PT
Flows between data centers could be very long lived and carry more bytes
than those between clients and data centers.
N
Why are these networks different ?
Persistent dedicated,100s of Gbps connectivity
between a (small) set of end-points
EL
Internet Private
(end-host to app / end- WAN WAN? data
host) ? center
PT
Design Flexibility
N
Microsoft: “expensive resource, with amortized annual
cost of 100s of millions of dollars”
[Achieving High Utilization with Software-Driven
WAN, Hong et al., ACM SIGCOMM’13]
(i) MPLS: Traditional WAN approach
• The traditional approach to traffic
engineering in such networks is to
use MPLS (Multiprotocol Label
EL
Switching)
• Network with several different
PT
sites spread over defined area,
connected to each other perhaps
N
over long distance fiber links.
1. Link-state protocol (OSPF / IS-IS)
• Use link-state protocol (OSPF or IS-IS) to flood information about the
network's topology to all nodes.
• So at the end of such a protocol, every node has a map of the network.
EL
PT
N
2. Flood available bandwidth information
• For traffic engineering, also spread, information about the bandwidth usage on
these different links in the network.
• Given that there's already traffic flowing in this network, some links will have spare
capacity and some won’t.
EL
• Both IS-IS and OSPF have extensions that allow the flooding of available bandwidth
information together with their protocol messages.
PT
N
3. Fulfill tunnel provisioning requests
• Knowing the set of the network, when the router receives a new flow set of
requests, it'll set up a tunnel along the shortest path on which enough
capacity's available. It sends protocol messages to routers on the path setting
up this tunnel.
EL
• Further, MPLS also supports the notion of priorities. Thereby if a higher
priority flow comes in with the request for a path, lower priority flows might be
displaced. These flows might then use higher latency or higher cross paths
PT
through the network.
N
4. Update network state, flood information
• After a flow is assigned a terminal, the routers also update
the network state.
EL
PT
N
4. Update network state, flood information
• When a data packet comes into the ingress router, the router looks at the packet's header and
decides what label, that is what tunnel this packet belongs to. Then it encapsulates this packet
with that tunnel's label and sends it along the tunnel.
• The egress router then decapsulates the packet, looks at the packet header again and sends it
to the destination. In this scheme, only the ingress and egress routers read the packet
header. Every other router on the path just looks at the assigned label.
EL
PT
N
Simple forwarding along the path
• Making forwarding along the path very simple. This is the reason the
protocol is called Multi-Protocol Label Switching (MPLS).
EL
• Also, MPLS can run over several different protocols, as long as the
ingress and egress routers understand that protocol, and can map
onto labels, that’s why the name is: multi-protocol label switching.
PT
N
Problem 1: Inefficiency
• First problem is inefficiency in terms of usage of the expensive bandwidth.
Typically, these networks would be provisioned for the peak traffic.
• As this image shows here, if you have the traffic over time, the y-axis utilization,
provision the network for peak traffic. Now, the mean usage of the network
EL
might be very small. In this example, it's 2.17 times smaller than the peak.
ACM SIGCOMM, 2013
PT
Utilizati
N
on
Tim
e
Problem 1: Inefficiency
Most of this traffic is actually background traffic, with some latency
sensitive traffic as well.
So you can provision for the peak of the latency sensitive traffic, and then
fill the gaps with the background which is not latency sensitive.
EL
ACM SIGCOMM, 2013
PT
N
Problem 1: Inefficiency
So unless you differentiate traffic by service, you cannot do such an
optimization. This is not easy to do with the MPLS approach because
it does not have a global view of what services are running in the
network, what parts of the network they are using and such.
EL
Also, a related point is that regardless of whether they are multiple
PT
services or not, MPLS, the routers make local greedy choices about
scheduling flows. So traffic engineering is sub optimal.
N
For these reasons, such networks typically run around 30% utilization
to have enough headroom for these inefficiencies, and this is
expensive.
Problem 2: Inflexible sharing
Another big problem with the MPLS approach, is that
it only provides link level fairness. So at any link, the
flows can share capacity fairly. But, this does not
mean network wide fairness.
EL
For example, we have the green flow sharing capacity
across that length with a red flow. The blue flow also
shares capacity with the red flow. But, the blue and
PT
green flows both get capacity half the red flow,
because the red flow uses multiple paths. So we have
link level fairness, but we do not have the network
wide fairness.
N
The network wide fairness is hard to achieve, unless
you have a global view of the network.
Cutting-edge WAN Traffic Engineering (TE)
Google’s B4
ACM SIGCOMM, 2013
EL
PT
Microsoft’s SwanACM SIGCOMM, 2013
N
1. Leverage service diversity: some tolerate delay
EL
need a certain amount of bandwidth at a certain moment of time and they're
inflexible and some other services can use to kind of fill in whatever room is
left over.
PT
• For example, latencies instead of queries.
N
2. Centralized TE using SDN, OpenFlow
• Software define networking approach gather information about the state
of the network.
EL
• Make a centralize decision about the flow of traffic and then push those
decisions down to lower levels to actually implement them.
PT
• But bringing all that information together in one place is a relatively
complex decision.
N
3. Exact linear programming is too slow
• Traditionally, with a optimization technique like linear
programming, which is a way to take a set of constraints on
required amounts of data flow over parts of a network and
come up with an optimal solution. But to apply it to the
EL
situation where we need to make relatively quick decisions.
PT
• Part of the complexity comes from the multitude of services,
the different priorities. If we have just one service, we could run
it in flow algorithm, and that would be much faster.
•
N
So it require something faster if it's not guaranteed to be
exactly optimal.
4. Dynamic reallocation of bandwidth
• The demands on the network change over time. So to make
continual decisions about what traffic is highest priority to
move across which links at a given moment is a challenge
with linear programming to make quick decisions.
EL
• So these are online algorithms. But they're not online in the
PT
same way as things inside the data center might be.
•
N
For example, Google runs its traffic engineering 500 or so
times a day. So, it's not as fine grained as things we might
need inside a data center. Traffic between these facilities is
relatively stable it seems.
4. Dynamic reallocation of bandwidth
• The demands on the network change over time. So to make
continual decisions about what traffic is highest priority to
move across which links at a given moment is a challenge
with linear programming to make quick decisions.
EL
• So these are online algorithms. But they're not online in the
PT
same way as things inside the data center might be.
•
N
For example, Google runs its traffic engineering 500 or so
times a day. So, it's not as fine grained as things we might
need inside a data center. Traffic between these facilities is
relatively stable it seems.
5. Edge rate limiting
• The commonality in the architecture is to implement an enforcement
of the flow rates.
EL
• So when the traffic enters the network and will do that at the edge,
rather than at every hop along the path.
PT
N
(ii) B4: Google’s WAN (2011)
• Google’s B4 was the first highly visible SDN success. It is a private
WAN connecting Google’s data centers across the planet. It has 12
locations spread across three different continents.
• It has a number of unique characteristics: i) massive bandwidth
EL
requirements deployed to a modest number of sites, ii) elastic traffic
demand that seeks to maximize average bandwidth, and iii) full
control over the edge servers and network, which enables rate limiting
PT
and demand measurement at the edge
N
Google’s B4 worldwide deployment (2011)
What happens at one site, one data center
EL
PT
N
Google’s B4: view at one site
• Inside one site, it has the data center network, of cluster & border routers.
• In the traditional setting, these are connected using eBGP to run routers, which
EL
would then also interface using iBGP or IS-IS to the other data center sites.
PT
N
WAN
Routers
Google’s B4: view at one site
• For finer control over routing, this will be moved to a software router, i.e. Quagga software
switch that run it on a server.
• The interface with open flow to set up routing rules on those routers. So, Quagga will run the
routing protocols between the cluster border routers, and also the other sites and then,
OpenFlow uses the routing information from Quagga and sets up forwarding rules in the WAN
EL
routers.
• Now we have software control and the traffic engineering (TE) server, which manages what
rules exactly are installed.
PT
Mixed SDN Deployment WAN Routers
N
Google’s B4: Traffic Engineering
• The TE server Collects the topology information, the available bandwidth information
and has the last information about flow demands between different sites.
• The TE server pushes out the flow allocations to the different data centers. At the data
centers the multiple controllers then enforce those flow allocations on the centers.
EL
PT
N
TE server
Google’s B4: Design Choices
• BGP routing as “big red switch”:
• They also keep available the BGP forwarding state because each of their switches
allows them to have multiple forwarding tables at the same time they can afford
to have BGP forwarding tables.
EL
• Also, in addition to the traffic engineering, if the traffic engineering scheme does
not work. They can discard those routing tables and use the BGP forwarding state
instead. So the BGP routing tables serve as a big red switch.
PT
N
Figure: A custom-built switch and its topology
Google’s B4: Design Choices
• Hardware: custom, cheap, simple:
EL
switches on the market. So Google built their own using cheap
commodity equipment.
PT
N
Figure: A custom-built switch and its topology
Google’s B4: Design Choices
• Software smart with safeguard:
EL
• Switches are simple in the software, which is the open flow control
logic at each site replicated for fault tolerant using paxos.
PT
• Further, for scalability of this system they use a hierarchy of
controllers. The software solution achieves nearly 100% utilization
N
and solves the trafficking problem in 0.3 seconds.
Google’s B4: Design Choices
• Hierarchy of controllers:
• At the top level we have the global controller, which is talking to an SDN gateway.
The gateway can be thought of as a super controller that talks to the controller's at all
these different data center sites.
EL
• Each site might itself have multiple controllers, because of the scale of these
networks. This hierarchy of controllers simplifies things from the global perspective.
PT
N
Figure: B4 architecture overview
Google’s B4: Design Choices
• Aggregation: flow groups, link groups:
EL
of mutual flows but of flow groups. That also helps scaling.
Further, each pair of sites is not connected by just one link.
PT
• These are massive capacity links that are formed from a trunk
of several parallel high capacity links.
•
N
All of these are aggregated and exposed to the traffic
engineering layer as one logical link. It is up to the individual
site to the partition traffic, multiplex and demultiplex traffic
across the set of links.
(iii) Microsoft’s Swan
• Microsoft has publicly disclosed the design for optimizing wide area traffic flow in
their WAN.
EL
• Interesting feature in this design is the way to make changes to the traffic flow
without causing congestion.
PT
ACM SIGCOMM, 2013
N
Architecture of Microsoft’s
Swan
Microsoft’s Swan
• Can think of different ways to do this but we always going to run into some link that may
get used by both lows at the same time. So, take a look at this design for SWAN there's
an approach to making those updates with a certain amount of spare capacity so that
congestion can be avoided.
EL
• This approach that takes the optimization one step further so that providing a pretty
strong guarantee on lack of congestion even while the network data flow of changing.
PT
N
Conclusion
• In this lecture, we have discussed the geo-distributed cloud data
centers, interaction of data centers with users and other data centers.
EL
• Also discuss various data center interconnection techniques such as
(i) MPLS (ii) Google’s B4 and (iii) Microsoft’s Swan
PT
N
EL
Thank You!
PT
N
Week 6 Lecture 2
EL
PT
N
0 mins
Time and Clock Synchronization
EL
PT
N
Dr. Rajiv Misra, Professor
Dept. of Computer Science & Engineering
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Preface
Content of this Lecture:
• In this lecture, we will discuss the fundamentals of clock synchronization in
cloud and its different algorithms.
EL
• We will also discuss the causality and a general framework of logical clocks and
present two systems of logical time, namely, lamport and vector, timestamps to
capture causality between events of a distributed computation
PT
N
Need of Synchronization
• You want to catch a bus at 9.05 am, but your watch is off by
15 minutes
• What if your watch is Late by 15 minutes?
EL
• You’ll miss the bus!
• What if your watch is Fast by 15 minutes?
PT
• You’ll end up unfairly waiting for a longer time than you
intended
•
• Correctness
N
Time synchronization is required for:
• Fairness
Time and Synchronization
• Time and Synchronization
(“There’s is never enough time…”)
EL
• Distributed Time
• The notion of time is well defined (and measurable) at
each single location
PT
• But the relationship between time at different locations is
unclear
•
• Correctness
N
Time Synchronization is required for:
• Fairness
Synchronization in the cloud
Example: Cloud based airline reservation system:
EL
• Server X timestamps the purchase using its local clock as
6h:25m:42.55s. It then logs it. Replies ok to the client.
PT
• That was the very last seat, Server X sends a message to Server Y
saying the “flight is full”.
• Y enters, “Flight PQR 123 is full” + its own local clock value,
• N
(which happens to read 6h:20m:20.21s).
Server Z, queries X's and Y's logs. Is confused that a client
purchased a ticket at X after the flight became full at Y.
• This may lead to full incorrect actions at Z
Key Challenges
• End-hosts in Internet based systems (like clouds)
• Each have its own clock
• Unlike processors (CPUs) within one server or
EL
workstation which share a system clock.
• Processes in internet based systems follow an
asynchronous model.
PT
• No bounds on
• Messages delays
•
•
N
Processing delays
Unlike multi-processor (or parallel) systems which
follow a synchronous system model
Definitions
• An asynchronous distributed system consists of a number of
processes.
• Each process has a state (values of variables).
EL
• Each process takes actions to change its state, which may be an
instruction or a communication action (send, receive).
PT
• An event is the occurrence of an action.
• Each process has a large clock – events within a process can be
assigned timestamps, and thus ordered linearly.
•
N
But- in a distributed system, we also need to know the time
order of events across different processes.
Space-time diagram Message
Internal
Message receive event
event
Process send event
EL
PT
N
Figure : The space-time diagram of a distributed
execution.
Clock Skew vs. Clock Drift
• Each process (running at some end host) has its own clock.
• When comparing two clocks at two processes.
• Clock Skew = Relative difference in clock values of two processes.
EL
• Like distance between two vehicles on road.
• Clock Drift = Relative difference in clock frequencies (rates) of two
processes
PT
• Like difference in speeds of two vehicles on the road.
• A non-zero clock skew implies clocks are not synchronized
•
N
A non-zero clock drift causes skew increases (eventually).
• If faster vehicle is ahead, it will drift away.
• If faster vehicle is behind, it will catch up and then drift away.
Clock Inaccuracies Figure: The behavior of fast, slow, and
perfect clocks with respect to UTC.
• Clocks that must not only be
synchronized with each other but also
have to adhere to physical time are
termed physical clocks.
EL
• Physical clocks are synchronized to an
accurate real-time standard like UTC
PT
(Universal Coordinated Time).
N
timer (clock) is said to be working within
its specification if (where constant ρ is
the maximum skew rate specified by the
manufacturer)
1−ρ≤ ≤1+ρ
How often to Synchronize
• Maximum Drift rate (MDR) of a clock
• Absolute MDR is defined to relative coordinated universal Time
(UTC). UTC is the correct time at any point of time.
EL
• MDR of any process depends on the environment.
• Maximum drift rate between two clocks with similar MDR is
2*MDR.
PT
• Given a maximum acceptable skew M between any pair of clocks,
need to synchronize at least once every: M/ (2* MDR) time units.
•
N
Since time = Distance/ Speed.
External vs Internal Synchronization
• Consider a group of processes
• External synchronization
• Each process C(i)’s clock is within a bounded D of a well- known
EL
clock S external to the group
• |C(i)- S|< D at all times.
• External clock may be connected to UTC (Universal Coordinated
PT
Time) or an atomic clock.
• Example: Christian’s algorithm, NTP
•
•
•
N
Internal Synchronization
Every pair of processes in group have clocks within bound D
|C(i)- C(j)|< D at all times and for all processes i,j.
• Example: Berkley Algorithm, DTP
External vs Internal Synchronization
EL
• Internal synchronization does not imply External Synchronization.
PT
• In fact, the entire system may drift away from the external clock S!
N
Basic Fundamentals
External time synchronization
All processes P synchronize with a time server S.
Set clock to t Time
P
EL
What’s the time?
Here’s the time t
PT
S
Check local clock to find time t
What’s Wrong:
•
•
•
N
By the time the message has received at P, time has moved on.
P’s time set to t is in accurate.
• Inaccuracy a function of message latencies.
• Since latencies unbounded in an asynchronous system, the
inaccuracy cannot be bounded.
(i) Christians Algorithm
P measures the round-trip-time RTT of message exchange
Suppose we know the minimum P → S latency min1
And the minimum S → P latency min2
EL
Min1 and Min2 depends on the OS overhead to buffer messages, TCP time to queue
messages, etc.
The actual time at P when it receives response is between [t+min2, t + RTT-
PT
min1] RTT
Set clock to t Time
P
What’s the
time?
N Here’s the time t!
S
Check local clock to find time t
(i) Christians Algorithm
The actual time at P when it receives response is between [t+min2, t +
RTT-min1]
P sets its time to halfway through this interval
EL
To: t + (RTT+min2-min1)/2
Error is at most (RTT- min2- min1)/2
PT
Bounded RTT
Set clock to t Time
P
What’s the
time?
N Here’s the time t!
S
Check local clock to find time t
Error Bounds
EL
PT
N
Error Bounds
• Allowed to increase clock value but should never decrease
clock value
• May violate ordering of events within the same process.
EL
• Allowed to increase or decrease speed of clock
PT
• If error is too high, take multiple readings and average them
N
Christians Algorithm: Example
• Send request at 5:08:15.100 (T0)
• Receive response at 5:08:15.900 (T1)
– Response contains 5:09:25.300
(Tserver)
EL
• Elapsed time is T1 -T0
• 5:08:15.900 - 5:08:15.100 = 800
PT
msec
• Best guess: timestamp was generated If best-case message
• 400 msec ago time=200 msec
Set time to Tserver+ elapsed time
•
N
• 5:09:25.300 + 400 = 5:09.25.700
T0 = 5:08:15.100
T1 = 5:08:15.900
T server= 5:09:25:300
Tmin = 200msec
(ii) NTP: Network time protocol
(1991, 1992) Internet Standard, version 3: RFC 1305
NTP servers organized in a tree.
Each client = a leaf of a tree.
EL
Each node synchronizes with its tree parent
Primary servers
PT
Secondary servers
N Tertiary servers
Client
NTP Protocol
EL
Child
Message 2
PT
Let’s start protocol
ts1, tr2
Message 1
Parent
EL
•
• Child is ahead of parent by oreal.
• Parent is ahead of child by –oreal.
PT
• Suppose one way latency of Message 1 is L1. (L2 for Message 2)
• No one knows L1 or L2!
• Then N
• tr1 = ts1 + L1 + oreal
• tr2 = ts2 + L2 – oreal
Why o = (tr1-tr2 + ts2- ts1)/2 ?
• Then
• tr1 = ts1 + L1 + oreal.
tr2 = ts2 + L2 – oreal.
EL
•
PT
•
• oreal = (tr1-tr2 + ts2- ts1)/2 – (L2-L1)/2
• => oreal = o + (L2-L1)/2
•
•
N
=> |oreal – o|< |(L2-L1)/2| < |(L2+L1)/2|
Thus the error is bounded by the round trip time (RTT)
(iii) Berkley’s Algorithm
• Gusella & Zatti, 1989
• Master poll’s each machine periodically
• Ask each machine for time
EL
• Can use Christian’s algorithm to compensate the network’s
latency.
• When results are in compute,
PT
• Including master’s time.
• Hope: average cancels out individual clock’s tendency to run fast or
•
slow
•
N
Send offset by which each clock needs adjustment to each slave
Avoids problems with network delays if we send a time-stamp.
Berkley’s Algorithm : Example
EL
PT
N
(iv) DTP: Datacenter Time Protocol
EL
• DTP uses the physical layer of network
devices to implement a decentralized clock
synchronization protocol.
PT
• Highly Scalable with bounded precision!
– ~25ns (4 clock ticks) between peers
N
– ~150ns for a datacenter with six hops
– No Network Traffic
– Internal Clock Synchronization
• End-to-End: ~200ns precision!
DTP: Phases
EL
PT
N
DTP: (i) Init Phase
• INIT phase: The purpose of the INIT phase is to measure the one-way
delay between two peers. The phase begins when two ports are
physically connected and start communicating, i.e. when the link
between them is established.
EL
• Each peer measures the one-way delay by measuring the time
between sending an INIT message and receiving an associated INIT-
ACK message, i.e. measure RTT, then divide the measured RTT by two.
PT
N
DTP: (ii) Beacon Phase
• BEACON phase: During the BEACON phase, two ports periodically
exchange their local counters for resynchronization. Due to oscillator
skew, the offset between two local counters will increase over time. A
port adjusts its local counter by selecting the maximum of the local and
remote counters upon receiving a BEACON message from its peer. Since
EL
BEACON messages are exchanged frequently, hundreds of thousands of
times a second (every few microseconds), the offset can be kept to a
minimum.
PT
N
DTP Switch
EL
PT
N
DTP Property
• DTP provides bounded precision and scalability
EL
• Bounded Precision in hardware
– Bounded by 4T (=25.6ns, T=oscillator tick is 6.4ns)
– Network precision bounded by 4TD
PT
D is network diameter in hops
•
N
Requires NIC and switch modifications
But Yet…
• We still have a non-zero error!
EL
• We just can’t seem to get rid of error
• Can’t as long as messages latencies are non-zero.
PT
• Can we avoid synchronizing clocks altogether, and
still be able to order events ?
N
Ordering events in a distributed system
• To order events across processes, trying to synchronize
clocks is an approach.
• What if we instead assigned timestamps to events that
EL
were not absolute time ?
• As long as those timestamps obey causality, that would
PT
work
• If an event A causally happens before another event B, then
timestamp(A) < timestamp (B)
•
•
N
Example: Humans use causality all the time
I enter the house only if I unlock it
• You receive a letter only after I send it
Logical (or Lamport) ordering
• Proposed by Leslie Lamport in the 1970s.
• Used in almost all distributed systems since then
• Almost all cloud computing systems use some
EL
form of logical ordering of events.
PT
• Leslie B. Lamport (born February 7, 1941) is an American
computer scientist. Lamport is best known for his seminal work in
distributed systems and as the initial developer of the document
N
preparation system LaTeX. Leslie Lamport was the winner of the
2013 Turing Award for imposing clear, well-defined coherence on
the seemingly chaotic behavior of distributed computing systems,
in which several autonomous computers communicate with each
other by passing messages.
Lamport’s research contributions
• Lamport’s research contributions have laid the foundations of the theory of
distributed systems. Among his most notable papers are
• “Time, Clocks, and the Ordering of Events in a Distributed System”, which received the
PODC Influential Paper Award in 2000,
• “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess
EL
Programs”,which defined the notion of Sequential consistency,
• “The Byzantine Generals' Problem”,
• “Distributed Snapshots: Determining Global States of a Distributed System” and
• “The Part-Time Parliament”.
PT
• These papers relate to such concepts as logical clocks (and the happened-
before relationship) and Byzantine failures. They are among the most cited
papers in the field of computer science and describe algorithms to solve many
fundamental problems in distributed systems, including:
•
•
•
N
the Paxos algorithm for consensus,
the bakery algorithm for mutual exclusion of multiple threads in a computer system that
require the same resources at the same time,
the Chandy-Lamport algorithm for the determination of consistent global states
(snapshot), and
• the Lamport signature, one of the prototypes of the digital signature.
Logical (or Lamport) Ordering(2)
• Define a logical relation Happens-Before among pairs
of events
• Happens-Before denoted as →
EL
• Three rules:
PT
the local clock)
2. If p1 sends m to p2: send(m) → receive(m)
N
3. (Transitivity) If a → b and b → c then a → c
• Creates a partial order among events
• Not all events related to each other via →
Example 1:
A B C D E
P1
Time
EL
P2 E F G
PT
P3 H I J
N Instruction or step
Message
Example 1: Happens-Before
A B C D E
P1
Time
EL
P2 E F G
PT
P3 H I J
• AB
• BF
N Instruction or step
• AF Message
Example 2: Happens-Before
A B C D E
P1
Time
EL
P2 E F G
PT
H I J
P3
•
•
HG
FJ
N Instruction or step
• HJ
Message
• CJ
Lamport timestamps
• Goal: Assign logical (Lamport) timestamp to each event
• Timestamps obey causality
• Rules
• Each process uses a local counter (clock) which is an
EL
integer
• initial value of counter is zero
• A process increments its counter when a send or an
PT
instruction happens at it. The counter is assigned to
the event as its timestamp.
• A send (message) event carries its timestamp
•
by
N
For a receive (message) event the counter is updated
EL
P2
PT
P3
N Instruction or step
Message
Lamport Timestamps
P1 0
Time
EL
P2 0
PT
P3 0
N Instruction or step
Initial counters (clocks)
Message
Lamport Timestamps
P1 0
ts = 1
Time
EL
P2 0
PT
Message carries
ts = 1
P3 0
ts = 1
Message send
N Instruction or step
Message
Lamport Timestamps
P1 0
1 ts = max(local, msg) + 1
Time
= max(0, 1)+1
EL
=2
P2 0
PT
Message carries
ts = 1
P3 0
1 N Instruction or step
Message
Lamport Timestamps
P1 0
1 2
Message carries Time
ts = 2
EL
P2
0 2
PT
max(2, 2)+1
=3
P3
0 1
N Instruction or step
Message
Lamport Timestamps
P1 0
1 2
Message carries Time
EL
ts = 2
P2
0 2
PT
max(2, 2)+1
=3
P3
0 1 N Instruction or step
Message
Lamport Timestamps
max(3, 4)+1
=5
P1 0
1 2 3
Time
EL
P2 0
2 3 4
PT
P3 0
1 N Instruction or step
Message
Lamport Timestamps
P1 0
1 2 3 5 6
Time
EL
P2 0
2 3 4
PT
P3 0
1 N 2 7
Instruction or step
Message
Obeying Causality
A B C D E
P1 0
1 2 3 5 6
Time
EL
E F G
P2 0
2 3 4
PT
H I J
P3 0
• A B :: 1 < 2
1 N 2 7
Instruction or step
• B F :: 2 < 3 Message
• A F :: 1 < 3
Obeying Causality (2)
A B C D E
P1 0
1 2 3 5 6
Time
EL
E F G
P2 0
2 3 4
PT
H I J
P3 0
7
H G :: 1 < 4
F J :: 3 < 7
1
N 2
Instruction or step
H J :: 1 < 7 Message
C J :: 3 < 7
Not always implying Causality
A B C D E
P1 0
1 2 3 5 6
Time
E F G
EL
P2 0
2 3 4
PT
H I J
P3 0
1 2 7
• ? C F ? :: 3 = 3
• ? H C ? :: 1 < 3
N Instruction or step
Message
• (C, F) and (H, C) are
pairs of concurrent
events
Concurrent Events
• A pair of concurrent events doesn’t have a causal path
from one event to another (either way, in the pair)
• Lamport timestamps not guaranteed to be ordered or
unequal for concurrent events
EL
• Ok, since concurrent events are not causality related!
PT
• Remember:
concurrent}
N
timestamp(E1) < timestamp (E2) ⇒
{E1 E2} OR {E1 and E2
Vector Timestamps
EL
• Suppose there are N processes in the group 1…N
• Each vector has N elements
PT
• Process i maintains vector Vi [1…N]
• jth element of vector clock at process i, Vi[j], is i’s
knowledge of latest events at process j
N
Assigning Vector Timestamps
EL
increments only its ith element of its vector clock
2. Each message carries the send-event’s vector timestamp
Vmessage[1…N]
PT
3. On receiving a message at process i:
Vi[i] = Vi[i] + 1
N
Vi[j] = max(Vmessage[j], Vi[j]) for j ≠ i
Example
A B C D E
P1
Time
EL
E F G
P2
PT
H I J
P3
N Instruction or step
Message
Vector Timestamps
P1
(0,0,0)
Time
EL
P2
PT
(0,0,0)
P3
(0,0,0) N
Initial counters (clocks)
Vector Timestamps
P1
(0,0,0) (1,0,0)
EL
Time
P2
PT
(0,0,0)
P3
(0,0,0)
N
Message(0,0,1)
(0,0,1)
Vector Timestamps
P1
(0,0,0) (1,0,0)
EL
Time
P2
PT
(0,0,0) (0,1,1)
P3
(0,0,0) (0,0,1)
N
Message(0,0,1)
Vector Timestamps
P1
(0,0,0) (1,0,0) (2,0,0)
EL
Message(2,0,0) Time
PT
P2
(0,0,0) (0,1,1) (2,2,1)
P3
(0,0,0) (0,0,1)
N
Vector Timestamps
P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
EL
Time
PT
P3
(0,0,0) (0,0,1)
N
(0,0,2) (5,3,3)
Causally-Related
• VT1 = VT2,
iff (if and only if)
VT1[i] = VT2[i], for all i = 1, … , N
EL
• VT1 ≤ VT2,
iff VT1[i] ≤ VT2[i], for all i = 1, … , N
PT
• Two events are causally related iff
VT1 < VT2, i.e.,
iff VT1 ≤ VT2 &
N
there exists j such that
1 ≤ j ≤ N & VT1[j] < VT2 [j]
… or Not Causally-Related
• Two events VT1 and VT2 are concurrent
iff
NOT (VT1 ≤ VT2) AND NOT (VT2 ≤ VT1)
EL
We’ll denote this as VT2 ||| VT1
PT
N
Obeying Causality
A B C D E
P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
Time
EL
E F G
P2 (0,0,0) (0,1,1) (2,2,1) (2,3,1)
PT
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
P3
H
EL
P2 (0,0,0) (0,1,1) (2,2,1) (2,3,1)
E F G
PT
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
P3
•
H
EL
E F G
P2 (0,0,0) (0,1,1) (2,2,1) (2,3,1)
PT
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
H I J
P3
N
• C & F :: (3,0,0) ||| (2,2,1)
• H & C :: (0,0,1) ||| (3,0,0)
• (C, F) and (H, C) are pairs of concurrent events
Summary : Logical Timestamps
• Lamport timestamp
• Integer clocks assigned to events.
EL
• Obeys causality
• Cannot distinguish concurrent events.
PT
• Vector timestamps
• Obey causality
• By using more space, can also identify concurrent events
N
Conclusion
• Clocks are unsynchronized in an asynchronous distributed
system
• But need to order events across processes!
EL
• Time synchronization:
• Christian’s algorithm
PT
• Berkeley algorithm
• NTP
• DTP
N
• But error a function of RTT
PT
N