Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

Week-6 Lecture Notes

The lecture discusses geo-distributed cloud data centers and their interconnection techniques, highlighting the importance of internet connectivity for data center performance. It covers traditional methods like MPLS and advanced approaches such as Google's B4 and Microsoft's Swan for traffic engineering. Key challenges include inefficiencies in bandwidth usage and inflexible sharing, which modern techniques aim to address through centralized decision-making and dynamic bandwidth allocation.

Uploaded by

ks0516
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Week-6 Lecture Notes

The lecture discusses geo-distributed cloud data centers and their interconnection techniques, highlighting the importance of internet connectivity for data center performance. It covers traditional methods like MPLS and advanced approaches such as Google's B4 and Microsoft's Swan for traffic engineering. Key challenges include inefficiencies in bandwidth usage and inflexible sharing, which modern techniques aim to address through centralized decision-making and dynamic bandwidth allocation.

Uploaded by

ks0516
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

Week 6 Lecture 1

EL
PT
N
0 mins
Geo-distributed Data Centers

EL
PT
N
Dr. Rajiv Misra, Professor
Dept. of Computer Science & Engineering
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Preface
Content of this Lecture:
• In this lecture, we will study the Geo-distributed
cloud data centers, interaction of data centers
with users and with other data centers.

EL
• We will also describe the data center
interconnection techniques such as (i) Traditional
schemes such as MPLS, (ii) Cutting edge such as

PT
Google’s B4 and (iii) Microsoft’s Swan

N
Inter-Data Center Networking: The Problem
• Today, the virtual use of any popular web application means to
communicate with a server in a data center.
• However the connectivity for this service depends on the internet
• Internet becomes crucial in application service’s performance

EL
• Data centers also communicate with each other over the internet.
Example: replicating client data across multiple data centers.
• In cloud scenario, the wide area connectivity or the internet is as

PT
crucial as the data center infrastructure.

The
N Internet
Why Multiple Data centers ?
• Why does a provider like Google need such an extensive infrastructure with so
many locations across a wide expanse of the globe?
• Better data availability: If one of the facilities goes down, due to a natural
disaster, you could still have the data be available at some other location if it is
replicated.

EL
• Load balancing: Multiple facilities can spread incoming and outgoing traffic
over the internet across a wider set of providers, over a wider geographic
regions.

PT
• Latency: If present in multiple paths of the globe then can reach clients in
different locations at smaller distances, thus reduces latency.
Local data laws: Several authority might require that companies store data


N
from that country in that jurisdiction itself.
Hybrid public-private operation: Can handle the average demand for service
from the private infrastructure, and then offload peak demand to the public
cloud.
Significant Inter-data center traffic

EL
• Study from five Yahoo
data centers from 2011.

PT
• The study is based on
anonymized traces from
border routers that
connect
datacenters.
to these
N
Fig.: Overview of five major Yahoo! data centers
and their network connectivity.
Significant Inter-data center traffic
Here two plots showing the number of flows between clients and the
data centers.
On the right, between the data centers themselves. The three lines on
each plot are for three different data center locations. Notice that the y-

EL
axis on these two plots are different. In terms of number of flows, the
traffic between data centers is 10% to 20% of the traffic from data centers
to clients.

PT
Flows between data centers could be very long lived and carry more bytes
than those between clients and data centers.

N
Why are these networks different ?
Persistent dedicated,100s of Gbps connectivity
between a (small) set of end-points

EL
Internet Private
(end-host to app / end- WAN WAN? data
host) ? center

PT
Design Flexibility

N
Microsoft: “expensive resource, with amortized annual
cost of 100s of millions of dollars”
[Achieving High Utilization with Software-Driven
WAN, Hong et al., ACM SIGCOMM’13]
(i) MPLS: Traditional WAN approach
• The traditional approach to traffic
engineering in such networks is to
use MPLS (Multiprotocol Label

EL
Switching)
• Network with several different

PT
sites spread over defined area,
connected to each other perhaps
N
over long distance fiber links.
1. Link-state protocol (OSPF / IS-IS)
• Use link-state protocol (OSPF or IS-IS) to flood information about the
network's topology to all nodes.
• So at the end of such a protocol, every node has a map of the network.

EL
PT
N
2. Flood available bandwidth information
• For traffic engineering, also spread, information about the bandwidth usage on
these different links in the network.
• Given that there's already traffic flowing in this network, some links will have spare
capacity and some won’t.

EL
• Both IS-IS and OSPF have extensions that allow the flooding of available bandwidth
information together with their protocol messages.

PT
N
3. Fulfill tunnel provisioning requests
• Knowing the set of the network, when the router receives a new flow set of
requests, it'll set up a tunnel along the shortest path on which enough
capacity's available. It sends protocol messages to routers on the path setting
up this tunnel.

EL
• Further, MPLS also supports the notion of priorities. Thereby if a higher
priority flow comes in with the request for a path, lower priority flows might be
displaced. These flows might then use higher latency or higher cross paths

PT
through the network.

N
4. Update network state, flood information
• After a flow is assigned a terminal, the routers also update
the network state.

EL
PT
N
4. Update network state, flood information
• When a data packet comes into the ingress router, the router looks at the packet's header and
decides what label, that is what tunnel this packet belongs to. Then it encapsulates this packet
with that tunnel's label and sends it along the tunnel.
• The egress router then decapsulates the packet, looks at the packet header again and sends it
to the destination. In this scheme, only the ingress and egress routers read the packet
header. Every other router on the path just looks at the assigned label.

EL
PT
N
Simple forwarding along the path
• Making forwarding along the path very simple. This is the reason the
protocol is called Multi-Protocol Label Switching (MPLS).

EL
• Also, MPLS can run over several different protocols, as long as the
ingress and egress routers understand that protocol, and can map
onto labels, that’s why the name is: multi-protocol label switching.

PT
N
Problem 1: Inefficiency
• First problem is inefficiency in terms of usage of the expensive bandwidth.
Typically, these networks would be provisioned for the peak traffic.
• As this image shows here, if you have the traffic over time, the y-axis utilization,
provision the network for peak traffic. Now, the mean usage of the network

EL
might be very small. In this example, it's 2.17 times smaller than the peak.
ACM SIGCOMM, 2013

PT
Utilizati
N
on
Tim
e
Problem 1: Inefficiency
Most of this traffic is actually background traffic, with some latency
sensitive traffic as well.
So you can provision for the peak of the latency sensitive traffic, and then
fill the gaps with the background which is not latency sensitive.

EL
ACM SIGCOMM, 2013

PT
N
Problem 1: Inefficiency
So unless you differentiate traffic by service, you cannot do such an
optimization. This is not easy to do with the MPLS approach because
it does not have a global view of what services are running in the
network, what parts of the network they are using and such.

EL
Also, a related point is that regardless of whether they are multiple

PT
services or not, MPLS, the routers make local greedy choices about
scheduling flows. So traffic engineering is sub optimal.

N
For these reasons, such networks typically run around 30% utilization
to have enough headroom for these inefficiencies, and this is
expensive.
Problem 2: Inflexible sharing
Another big problem with the MPLS approach, is that
it only provides link level fairness. So at any link, the
flows can share capacity fairly. But, this does not
mean network wide fairness.

EL
For example, we have the green flow sharing capacity
across that length with a red flow. The blue flow also
shares capacity with the red flow. But, the blue and

PT
green flows both get capacity half the red flow,
because the red flow uses multiple paths. So we have
link level fairness, but we do not have the network
wide fairness.
N
The network wide fairness is hard to achieve, unless
you have a global view of the network.
Cutting-edge WAN Traffic Engineering (TE)
Google’s B4
ACM SIGCOMM, 2013

EL
PT
Microsoft’s SwanACM SIGCOMM, 2013

N
1. Leverage service diversity: some tolerate delay

• To get very high bandwidth utilization in the WAN because of natural


fluctuations over time. So, to leverage diversity in the services some services

EL
need a certain amount of bandwidth at a certain moment of time and they're
inflexible and some other services can use to kind of fill in whatever room is
left over.

PT
• For example, latencies instead of queries.

N
2. Centralized TE using SDN, OpenFlow
• Software define networking approach gather information about the state
of the network.

EL
• Make a centralize decision about the flow of traffic and then push those
decisions down to lower levels to actually implement them.

PT
• But bringing all that information together in one place is a relatively
complex decision.

N
3. Exact linear programming is too slow
• Traditionally, with a optimization technique like linear
programming, which is a way to take a set of constraints on
required amounts of data flow over parts of a network and
come up with an optimal solution. But to apply it to the

EL
situation where we need to make relatively quick decisions.

PT
• Part of the complexity comes from the multitude of services,
the different priorities. If we have just one service, we could run
it in flow algorithm, and that would be much faster.


N
So it require something faster if it's not guaranteed to be
exactly optimal.
4. Dynamic reallocation of bandwidth
• The demands on the network change over time. So to make
continual decisions about what traffic is highest priority to
move across which links at a given moment is a challenge
with linear programming to make quick decisions.

EL
• So these are online algorithms. But they're not online in the

PT
same way as things inside the data center might be.


N
For example, Google runs its traffic engineering 500 or so
times a day. So, it's not as fine grained as things we might
need inside a data center. Traffic between these facilities is
relatively stable it seems.
4. Dynamic reallocation of bandwidth
• The demands on the network change over time. So to make
continual decisions about what traffic is highest priority to
move across which links at a given moment is a challenge
with linear programming to make quick decisions.

EL
• So these are online algorithms. But they're not online in the

PT
same way as things inside the data center might be.


N
For example, Google runs its traffic engineering 500 or so
times a day. So, it's not as fine grained as things we might
need inside a data center. Traffic between these facilities is
relatively stable it seems.
5. Edge rate limiting
• The commonality in the architecture is to implement an enforcement
of the flow rates.

EL
• So when the traffic enters the network and will do that at the edge,
rather than at every hop along the path.

PT
N
(ii) B4: Google’s WAN (2011)
• Google’s B4 was the first highly visible SDN success. It is a private
WAN connecting Google’s data centers across the planet. It has 12
locations spread across three different continents.
• It has a number of unique characteristics: i) massive bandwidth

EL
requirements deployed to a modest number of sites, ii) elastic traffic
demand that seeks to maximize average bandwidth, and iii) full
control over the edge servers and network, which enables rate limiting

PT
and demand measurement at the edge

N
Google’s B4 worldwide deployment (2011)
What happens at one site, one data center

EL
PT
N
Google’s B4: view at one site
• Inside one site, it has the data center network, of cluster & border routers.

• In the traditional setting, these are connected using eBGP to run routers, which

EL
would then also interface using iBGP or IS-IS to the other data center sites.

PT
N
WAN
Routers
Google’s B4: view at one site
• For finer control over routing, this will be moved to a software router, i.e. Quagga software
switch that run it on a server.
• The interface with open flow to set up routing rules on those routers. So, Quagga will run the
routing protocols between the cluster border routers, and also the other sites and then,
OpenFlow uses the routing information from Quagga and sets up forwarding rules in the WAN

EL
routers.
• Now we have software control and the traffic engineering (TE) server, which manages what
rules exactly are installed.

PT
Mixed SDN Deployment WAN Routers

N
Google’s B4: Traffic Engineering
• The TE server Collects the topology information, the available bandwidth information
and has the last information about flow demands between different sites.
• The TE server pushes out the flow allocations to the different data centers. At the data
centers the multiple controllers then enforce those flow allocations on the centers.

EL
PT
N
TE server
Google’s B4: Design Choices
• BGP routing as “big red switch”:
• They also keep available the BGP forwarding state because each of their switches
allows them to have multiple forwarding tables at the same time they can afford
to have BGP forwarding tables.

EL
• Also, in addition to the traffic engineering, if the traffic engineering scheme does
not work. They can discard those routing tables and use the BGP forwarding state
instead. So the BGP routing tables serve as a big red switch.

PT
N
Figure: A custom-built switch and its topology
Google’s B4: Design Choices
• Hardware: custom, cheap, simple:

• In 2011, there were not many software defined networking

EL
switches on the market. So Google built their own using cheap
commodity equipment.

PT
N
Figure: A custom-built switch and its topology
Google’s B4: Design Choices
• Software smart with safeguard:

• Most of these smarts then reside in the software.

EL
• Switches are simple in the software, which is the open flow control
logic at each site replicated for fault tolerant using paxos.

PT
• Further, for scalability of this system they use a hierarchy of
controllers. The software solution achieves nearly 100% utilization

N
and solves the trafficking problem in 0.3 seconds.
Google’s B4: Design Choices
• Hierarchy of controllers:
• At the top level we have the global controller, which is talking to an SDN gateway.
The gateway can be thought of as a super controller that talks to the controller's at all
these different data center sites.

EL
• Each site might itself have multiple controllers, because of the scale of these
networks. This hierarchy of controllers simplifies things from the global perspective.

PT
N
Figure: B4 architecture overview
Google’s B4: Design Choices
• Aggregation: flow groups, link groups:

• Earlier, traffic engineering at this global scale is not at the level

EL
of mutual flows but of flow groups. That also helps scaling.
Further, each pair of sites is not connected by just one link.

PT
• These are massive capacity links that are formed from a trunk
of several parallel high capacity links.


N
All of these are aggregated and exposed to the traffic
engineering layer as one logical link. It is up to the individual
site to the partition traffic, multiplex and demultiplex traffic
across the set of links.
(iii) Microsoft’s Swan
• Microsoft has publicly disclosed the design for optimizing wide area traffic flow in
their WAN.

EL
• Interesting feature in this design is the way to make changes to the traffic flow
without causing congestion.

PT
ACM SIGCOMM, 2013

N
Architecture of Microsoft’s
Swan
Microsoft’s Swan
• Can think of different ways to do this but we always going to run into some link that may
get used by both lows at the same time. So, take a look at this design for SWAN there's
an approach to making those updates with a certain amount of spare capacity so that
congestion can be avoided.

EL
• This approach that takes the optimization one step further so that providing a pretty
strong guarantee on lack of congestion even while the network data flow of changing.

PT
N
Conclusion
• In this lecture, we have discussed the geo-distributed cloud data
centers, interaction of data centers with users and other data centers.

EL
• Also discuss various data center interconnection techniques such as
(i) MPLS (ii) Google’s B4 and (iii) Microsoft’s Swan

PT
N
EL
Thank You!

PT
N
Week 6 Lecture 2

EL
PT
N
0 mins
Time and Clock Synchronization

EL
PT
N
Dr. Rajiv Misra, Professor
Dept. of Computer Science & Engineering
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Preface
Content of this Lecture:
• In this lecture, we will discuss the fundamentals of clock synchronization in
cloud and its different algorithms.

EL
• We will also discuss the causality and a general framework of logical clocks and
present two systems of logical time, namely, lamport and vector, timestamps to
capture causality between events of a distributed computation

PT
N
Need of Synchronization
• You want to catch a bus at 9.05 am, but your watch is off by
15 minutes
• What if your watch is Late by 15 minutes?

EL
• You’ll miss the bus!
• What if your watch is Fast by 15 minutes?

PT
• You’ll end up unfairly waiting for a longer time than you
intended


• Correctness
N
Time synchronization is required for:

• Fairness
Time and Synchronization
• Time and Synchronization
(“There’s is never enough time…”)

EL
• Distributed Time
• The notion of time is well defined (and measurable) at
each single location

PT
• But the relationship between time at different locations is
unclear


• Correctness
N
Time Synchronization is required for:

• Fairness
Synchronization in the cloud
Example: Cloud based airline reservation system:

• Server X receives, a client request, to purchase the last ticket on


a flight, say PQR 123.

EL
• Server X timestamps the purchase using its local clock as
6h:25m:42.55s. It then logs it. Replies ok to the client.

PT
• That was the very last seat, Server X sends a message to Server Y
saying the “flight is full”.
• Y enters, “Flight PQR 123 is full” + its own local clock value,

• N
(which happens to read 6h:20m:20.21s).
Server Z, queries X's and Y's logs. Is confused that a client
purchased a ticket at X after the flight became full at Y.
• This may lead to full incorrect actions at Z
Key Challenges
• End-hosts in Internet based systems (like clouds)
• Each have its own clock
• Unlike processors (CPUs) within one server or

EL
workstation which share a system clock.
• Processes in internet based systems follow an
asynchronous model.

PT
• No bounds on
• Messages delays



N
Processing delays
Unlike multi-processor (or parallel) systems which
follow a synchronous system model
Definitions
• An asynchronous distributed system consists of a number of
processes.
• Each process has a state (values of variables).

EL
• Each process takes actions to change its state, which may be an
instruction or a communication action (send, receive).

PT
• An event is the occurrence of an action.
• Each process has a large clock – events within a process can be
assigned timestamps, and thus ordered linearly.

N
But- in a distributed system, we also need to know the time
order of events across different processes.
Space-time diagram Message
Internal
Message receive event
event
Process send event

EL
PT
N
Figure : The space-time diagram of a distributed
execution.
Clock Skew vs. Clock Drift
• Each process (running at some end host) has its own clock.
• When comparing two clocks at two processes.
• Clock Skew = Relative difference in clock values of two processes.

EL
• Like distance between two vehicles on road.
• Clock Drift = Relative difference in clock frequencies (rates) of two
processes

PT
• Like difference in speeds of two vehicles on the road.
• A non-zero clock skew implies clocks are not synchronized

N
A non-zero clock drift causes skew increases (eventually).
• If faster vehicle is ahead, it will drift away.
• If faster vehicle is behind, it will catch up and then drift away.
Clock Inaccuracies Figure: The behavior of fast, slow, and
perfect clocks with respect to UTC.
• Clocks that must not only be
synchronized with each other but also
have to adhere to physical time are
termed physical clocks.

EL
• Physical clocks are synchronized to an
accurate real-time standard like UTC

PT
(Universal Coordinated Time).

• However, due to the clock inaccuracy, a

N
timer (clock) is said to be working within
its specification if (where constant ρ is
the maximum skew rate specified by the
manufacturer)
1−ρ≤ ≤1+ρ
How often to Synchronize
• Maximum Drift rate (MDR) of a clock
• Absolute MDR is defined to relative coordinated universal Time
(UTC). UTC is the correct time at any point of time.

EL
• MDR of any process depends on the environment.
• Maximum drift rate between two clocks with similar MDR is
2*MDR.

PT
• Given a maximum acceptable skew M between any pair of clocks,
need to synchronize at least once every: M/ (2* MDR) time units.

N
Since time = Distance/ Speed.
External vs Internal Synchronization
• Consider a group of processes
• External synchronization
• Each process C(i)’s clock is within a bounded D of a well- known

EL
clock S external to the group
• |C(i)- S|< D at all times.
• External clock may be connected to UTC (Universal Coordinated

PT
Time) or an atomic clock.
• Example: Christian’s algorithm, NTP



N
Internal Synchronization
Every pair of processes in group have clocks within bound D
|C(i)- C(j)|< D at all times and for all processes i,j.
• Example: Berkley Algorithm, DTP
External vs Internal Synchronization

• External synchronization with D => Internal synchronization with


2*D.

EL
• Internal synchronization does not imply External Synchronization.

PT
• In fact, the entire system may drift away from the external clock S!

N
Basic Fundamentals
External time synchronization
All processes P synchronize with a time server S.
Set clock to t Time
P

EL
What’s the time?
Here’s the time t

PT
S
Check local clock to find time t

What’s Wrong:



N
By the time the message has received at P, time has moved on.
P’s time set to t is in accurate.
• Inaccuracy a function of message latencies.
• Since latencies unbounded in an asynchronous system, the
inaccuracy cannot be bounded.
(i) Christians Algorithm
P measures the round-trip-time RTT of message exchange
Suppose we know the minimum P → S latency min1
And the minimum S → P latency min2

EL
 Min1 and Min2 depends on the OS overhead to buffer messages, TCP time to queue
messages, etc.
The actual time at P when it receives response is between [t+min2, t + RTT-

PT
min1] RTT
Set clock to t Time
P

What’s the
time?
N Here’s the time t!
S
Check local clock to find time t
(i) Christians Algorithm
The actual time at P when it receives response is between [t+min2, t +
RTT-min1]
P sets its time to halfway through this interval

EL
To: t + (RTT+min2-min1)/2
Error is at most (RTT- min2- min1)/2

PT
Bounded RTT
Set clock to t Time
P

What’s the
time?
N Here’s the time t!
S
Check local clock to find time t
Error Bounds

EL
PT
N
Error Bounds
• Allowed to increase clock value but should never decrease
clock value
• May violate ordering of events within the same process.

EL
• Allowed to increase or decrease speed of clock

PT
• If error is too high, take multiple readings and average them
N
Christians Algorithm: Example
• Send request at 5:08:15.100 (T0)
• Receive response at 5:08:15.900 (T1)
– Response contains 5:09:25.300
(Tserver)

EL
• Elapsed time is T1 -T0
• 5:08:15.900 - 5:08:15.100 = 800

PT
msec
• Best guess: timestamp was generated If best-case message
• 400 msec ago time=200 msec
Set time to Tserver+ elapsed time

N
• 5:09:25.300 + 400 = 5:09.25.700
T0 = 5:08:15.100
T1 = 5:08:15.900
T server= 5:09:25:300
Tmin = 200msec
(ii) NTP: Network time protocol
(1991, 1992) Internet Standard, version 3: RFC 1305
NTP servers organized in a tree.
Each client = a leaf of a tree.

EL
Each node synchronizes with its tree parent
Primary servers

PT
Secondary servers
N Tertiary servers

Client
NTP Protocol

Message 1 recv time tr1


Message 2 send time ts2
Time

EL
Child

Message 2

PT
Let’s start protocol
ts1, tr2
Message 1
Parent

N Message 2 recv time tr2


Message 1 send time ts1
Why o = (tr1-tr2 + ts2- ts1)/2 ?
• Offset o = (tr1-tr2 + ts2- ts1)/2
• Let’s calculate the error.
Suppose real offset is oreal

EL

• Child is ahead of parent by oreal.
• Parent is ahead of child by –oreal.

PT
• Suppose one way latency of Message 1 is L1. (L2 for Message 2)
• No one knows L1 or L2!
• Then N
• tr1 = ts1 + L1 + oreal
• tr2 = ts2 + L2 – oreal
Why o = (tr1-tr2 + ts2- ts1)/2 ?
• Then
• tr1 = ts1 + L1 + oreal.
tr2 = ts2 + L2 – oreal.

EL

Subtracting second equation from first

PT

• oreal = (tr1-tr2 + ts2- ts1)/2 – (L2-L1)/2
• => oreal = o + (L2-L1)/2


N
=> |oreal – o|< |(L2-L1)/2| < |(L2+L1)/2|
Thus the error is bounded by the round trip time (RTT)
(iii) Berkley’s Algorithm
• Gusella & Zatti, 1989
• Master poll’s each machine periodically
• Ask each machine for time

EL
• Can use Christian’s algorithm to compensate the network’s
latency.
• When results are in compute,

PT
• Including master’s time.
• Hope: average cancels out individual clock’s tendency to run fast or


slow


N
Send offset by which each clock needs adjustment to each slave
Avoids problems with network delays if we send a time-stamp.
Berkley’s Algorithm : Example

EL
PT
N
(iv) DTP: Datacenter Time Protocol

EL
• DTP uses the physical layer of network
devices to implement a decentralized clock
synchronization protocol.

PT
• Highly Scalable with bounded precision!
– ~25ns (4 clock ticks) between peers

N
– ~150ns for a datacenter with six hops
– No Network Traffic
– Internal Clock Synchronization
• End-to-End: ~200ns precision!
DTP: Phases

EL
PT
N
DTP: (i) Init Phase
• INIT phase: The purpose of the INIT phase is to measure the one-way
delay between two peers. The phase begins when two ports are
physically connected and start communicating, i.e. when the link
between them is established.

EL
• Each peer measures the one-way delay by measuring the time
between sending an INIT message and receiving an associated INIT-
ACK message, i.e. measure RTT, then divide the measured RTT by two.

PT
N
DTP: (ii) Beacon Phase
• BEACON phase: During the BEACON phase, two ports periodically
exchange their local counters for resynchronization. Due to oscillator
skew, the offset between two local counters will increase over time. A
port adjusts its local counter by selecting the maximum of the local and
remote counters upon receiving a BEACON message from its peer. Since

EL
BEACON messages are exchanged frequently, hundreds of thousands of
times a second (every few microseconds), the offset can be kept to a
minimum.

PT
N
DTP Switch

EL
PT
N
DTP Property
• DTP provides bounded precision and scalability

EL
• Bounded Precision in hardware
– Bounded by 4T (=25.6ns, T=oscillator tick is 6.4ns)
– Network precision bounded by 4TD

PT
D is network diameter in hops


N
Requires NIC and switch modifications
But Yet…
• We still have a non-zero error!

EL
• We just can’t seem to get rid of error
• Can’t as long as messages latencies are non-zero.

PT
• Can we avoid synchronizing clocks altogether, and
still be able to order events ?
N
Ordering events in a distributed system
• To order events across processes, trying to synchronize
clocks is an approach.
• What if we instead assigned timestamps to events that

EL
were not absolute time ?
• As long as those timestamps obey causality, that would

PT
work
• If an event A causally happens before another event B, then
timestamp(A) < timestamp (B)


N
Example: Humans use causality all the time
I enter the house only if I unlock it
• You receive a letter only after I send it
Logical (or Lamport) ordering
• Proposed by Leslie Lamport in the 1970s.
• Used in almost all distributed systems since then
• Almost all cloud computing systems use some

EL
form of logical ordering of events.

PT
• Leslie B. Lamport (born February 7, 1941) is an American
computer scientist. Lamport is best known for his seminal work in
distributed systems and as the initial developer of the document
N
preparation system LaTeX. Leslie Lamport was the winner of the
2013 Turing Award for imposing clear, well-defined coherence on
the seemingly chaotic behavior of distributed computing systems,
in which several autonomous computers communicate with each
other by passing messages.
Lamport’s research contributions
• Lamport’s research contributions have laid the foundations of the theory of
distributed systems. Among his most notable papers are
• “Time, Clocks, and the Ordering of Events in a Distributed System”, which received the
PODC Influential Paper Award in 2000,
• “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess

EL
Programs”,which defined the notion of Sequential consistency,
• “The Byzantine Generals' Problem”,
• “Distributed Snapshots: Determining Global States of a Distributed System” and
• “The Part-Time Parliament”.

PT
• These papers relate to such concepts as logical clocks (and the happened-
before relationship) and Byzantine failures. They are among the most cited
papers in the field of computer science and describe algorithms to solve many
fundamental problems in distributed systems, including:



N
the Paxos algorithm for consensus,
the bakery algorithm for mutual exclusion of multiple threads in a computer system that
require the same resources at the same time,
the Chandy-Lamport algorithm for the determination of consistent global states
(snapshot), and
• the Lamport signature, one of the prototypes of the digital signature.
Logical (or Lamport) Ordering(2)
• Define a logical relation Happens-Before among pairs
of events
• Happens-Before denoted as →

EL
• Three rules:

1. On the same process: a → b, if time(a) < time(b) (using

PT
the local clock)
2. If p1 sends m to p2: send(m) → receive(m)

N
3. (Transitivity) If a → b and b → c then a → c
• Creates a partial order among events
• Not all events related to each other via →
Example 1:
A B C D E
P1
Time

EL
P2 E F G

PT
P3 H I J
N Instruction or step

Message
Example 1: Happens-Before
A B C D E
P1
Time

EL
P2 E F G

PT
P3 H I J

• AB
• BF
N Instruction or step

• AF Message
Example 2: Happens-Before
A B C D E
P1
Time

EL
P2 E F G

PT
H I J
P3


HG
FJ
N Instruction or step
• HJ
Message
• CJ
Lamport timestamps
• Goal: Assign logical (Lamport) timestamp to each event
• Timestamps obey causality
• Rules
• Each process uses a local counter (clock) which is an

EL
integer
• initial value of counter is zero
• A process increments its counter when a send or an

PT
instruction happens at it. The counter is assigned to
the event as its timestamp.
• A send (message) event carries its timestamp

by
N
For a receive (message) event the counter is updated

max(local clock, message timestamp) + 1


Example
P1
Time

EL
P2

PT
P3
N Instruction or step

Message
Lamport Timestamps
P1 0
Time

EL
P2 0

PT
P3 0
N Instruction or step
Initial counters (clocks)
Message
Lamport Timestamps
P1 0
ts = 1
Time

EL
P2 0

PT
Message carries
ts = 1
P3 0
ts = 1
Message send
N Instruction or step

Message
Lamport Timestamps
P1 0
1 ts = max(local, msg) + 1
Time
= max(0, 1)+1

EL
=2
P2 0

PT
Message carries
ts = 1
P3 0
1 N Instruction or step

Message
Lamport Timestamps
P1 0
1 2
Message carries Time
ts = 2

EL
P2
0 2

PT
max(2, 2)+1
=3
P3
0 1
N Instruction or step

Message
Lamport Timestamps
P1 0
1 2
Message carries Time

EL
ts = 2
P2
0 2

PT
max(2, 2)+1
=3
P3
0 1 N Instruction or step

Message
Lamport Timestamps
max(3, 4)+1
=5
P1 0
1 2 3
Time

EL
P2 0
2 3 4

PT
P3 0
1 N Instruction or step

Message
Lamport Timestamps
P1 0
1 2 3 5 6
Time

EL
P2 0
2 3 4

PT
P3 0
1 N 2 7
Instruction or step

Message
Obeying Causality
A B C D E
P1 0
1 2 3 5 6
Time

EL
E F G
P2 0
2 3 4

PT
H I J
P3 0

• A  B :: 1 < 2
1 N 2 7
Instruction or step

• B  F :: 2 < 3 Message
• A  F :: 1 < 3
Obeying Causality (2)
A B C D E
P1 0
1 2 3 5 6
Time

EL
E F G
P2 0
2 3 4

PT
H I J
P3 0
7
H  G :: 1 < 4
F  J :: 3 < 7
1
N 2
Instruction or step

H  J :: 1 < 7 Message
C  J :: 3 < 7
Not always implying Causality
A B C D E
P1 0
1 2 3 5 6
Time
E F G

EL
P2 0
2 3 4

PT
H I J
P3 0
1 2 7
• ? C  F ? :: 3 = 3
• ? H  C ? :: 1 < 3
N Instruction or step

Message
• (C, F) and (H, C) are
pairs of concurrent
events
Concurrent Events
• A pair of concurrent events doesn’t have a causal path
from one event to another (either way, in the pair)
• Lamport timestamps not guaranteed to be ordered or
unequal for concurrent events

EL
• Ok, since concurrent events are not causality related!

PT
• Remember:

E1  E2 ⇒ timestamp(E1) < timestamp (E2), BUT

concurrent}
N
timestamp(E1) < timestamp (E2) ⇒
{E1  E2} OR {E1 and E2
Vector Timestamps

• Used in key-value stores like Riak


• Each process uses a vector of integer clocks

EL
• Suppose there are N processes in the group 1…N
• Each vector has N elements

PT
• Process i maintains vector Vi [1…N]
• jth element of vector clock at process i, Vi[j], is i’s
knowledge of latest events at process j
N
Assigning Vector Timestamps

Incrementing vector clocks


1. On an instruction or send event at process i, it

EL
increments only its ith element of its vector clock
2. Each message carries the send-event’s vector timestamp
Vmessage[1…N]

PT
3. On receiving a message at process i:
Vi[i] = Vi[i] + 1
N
Vi[j] = max(Vmessage[j], Vi[j]) for j ≠ i
Example
A B C D E
P1
Time

EL
E F G
P2

PT
H I J
P3
N Instruction or step

Message
Vector Timestamps

P1
(0,0,0)
Time

EL
P2

PT
(0,0,0)

P3
(0,0,0) N
Initial counters (clocks)
Vector Timestamps

P1
(0,0,0) (1,0,0)

EL
Time

P2

PT
(0,0,0)

P3
(0,0,0)
N
Message(0,0,1)
(0,0,1)
Vector Timestamps

P1
(0,0,0) (1,0,0)

EL
Time

P2

PT
(0,0,0) (0,1,1)

P3
(0,0,0) (0,0,1)
N
Message(0,0,1)
Vector Timestamps

P1
(0,0,0) (1,0,0) (2,0,0)

EL
Message(2,0,0) Time

PT
P2
(0,0,0) (0,1,1) (2,2,1)

P3
(0,0,0) (0,0,1)
N
Vector Timestamps

P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)

EL
Time

(0,0,0) (0,1,1) (2,2,1) (2,3,1)


P2

PT
P3
(0,0,0) (0,0,1)
N
(0,0,2) (5,3,3)
Causally-Related
• VT1 = VT2,
iff (if and only if)
VT1[i] = VT2[i], for all i = 1, … , N

EL
• VT1 ≤ VT2,
iff VT1[i] ≤ VT2[i], for all i = 1, … , N

PT
• Two events are causally related iff
VT1 < VT2, i.e.,
iff VT1 ≤ VT2 &
N
there exists j such that
1 ≤ j ≤ N & VT1[j] < VT2 [j]
… or Not Causally-Related
• Two events VT1 and VT2 are concurrent
iff
NOT (VT1 ≤ VT2) AND NOT (VT2 ≤ VT1)

EL
We’ll denote this as VT2 ||| VT1

PT
N
Obeying Causality
A B C D E
P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
Time

EL
E F G
P2 (0,0,0) (0,1,1) (2,2,1) (2,3,1)

PT
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
P3
H

• A  B :: (1,0,0) < (2,0,0)


N I J

• B  F :: (2,0,0) < (2,2,1)


• A  F :: (1,0,0) < (2,2,1)
Obeying Causality (2)
A B C D E
P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
Time

EL
P2 (0,0,0) (0,1,1) (2,2,1) (2,3,1)
E F G

PT
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
P3


H

H  G :: (0,0,1) < (2,3,1)


N I J

• F  J :: (2,2,1) < (5,3,3)


• H  J :: (0,0,1) < (5,3,3)
• C  J :: (3,0,0) < (5,3,3)
Identifying Concurrent Events
A B C D E
P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
Time

EL
E F G
P2 (0,0,0) (0,1,1) (2,2,1) (2,3,1)

PT
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
H I J
P3

N
• C & F :: (3,0,0) ||| (2,2,1)
• H & C :: (0,0,1) ||| (3,0,0)
• (C, F) and (H, C) are pairs of concurrent events
Summary : Logical Timestamps

• Lamport timestamp
• Integer clocks assigned to events.

EL
• Obeys causality
• Cannot distinguish concurrent events.

PT
• Vector timestamps
• Obey causality
• By using more space, can also identify concurrent events

N
Conclusion
• Clocks are unsynchronized in an asynchronous distributed
system
• But need to order events across processes!

EL
• Time synchronization:
• Christian’s algorithm

PT
• Berkeley algorithm
• NTP
• DTP

N
• But error a function of RTT

• Can avoid time synchronization altogether by instead assigning


logical timestamps to events
EL
Thank You!

PT
N

You might also like