Cs3551 Distributed Computing Unit-1
Cs3551 Distributed Computing Unit-1
INTRODUCTION
The process of computation was started from working on a single processor. This uni-
processor computing can be termed as centralized computing. As the demand for the increased
processing capability grew high, multiprocessor systems came to existence. The advent of
multiprocessor systems, led to the development of distributed systems with high degree of
scalability and resource sharing. The modern day parallel computing is a subset of distributed
computing
In distributed systems, the entire network will be viewed as a computer. The multiple systems
connected to the network will appear as a single system to the user. Thus the distributed
systems hide the complexity of the underlying architecture to the user. Distributed computing is
a special version of parallel computing where the processors are in different computers and tasks
are distributed to computers over a network.
Deals with hardware: The machines linked in a distributed system are autonomous.
Deals with software: A distributed system gives an impression to the users that they are
dealing with a single system.
Performance
Reliability
Availability
Security
In Centralized Systems, several jobs are done In Distributed Systems, jobs are distributed
on a particular central processing unit(CPU) among several processors. The Processor are
interconnected by a computer network
They have shared memory and shared They have no global state (i.e.) no shared
variables. memory and no shared variables.
As shown in Fig., Each computer has a memory-processing unit and the computers are
connected by a communication network. Each system connected to the distributed networks
hosts distributed software which is a middleware technology. This drives the Distributed
System (DS) at the same time preserves the heterogeneity of the DS. The term computation or
run in a distributed system is the execution of processes to achieve a common goal.
Parallel Processing Systems divide the program into multiple segments and process them
simultaneously.
The main objective of parallel systems is to improve the processing speed. They are
sometimes known as multiprocessor or multi computers or tightly coupled systems. They
refer to simultaneous use of multiple computer resources that can include a single computer
with multiple processors, a number of computers connected by a network to form a parallel
processing cluster or a combination of both.
Characteristics of parallel systems
A parallel system may be broadly classified as belonging to one of three types:
1. A multiprocessor system
2. A multicomputer parallel system
3. Array processors
1. A multiprocessor system
A multiprocessor system is a parallel system in which the multiple processors have direct
to shared memory which forms a common address space. The architecture is shown in
Figure: Two standard architectures for parallel systems. (a) Uniform memory access
(UMA) multiprocessor system. (b) Non-uniform memory access (NUMA) multiprocessor.
i) Uniform Memory Access (UMA)
Here, all the processors share the physical memory in a centralized manner with equal
access time to all the memory words.
Each processor may have a private cache memory. Same rule is followed for peripheral
devices.
When all the processors have equal access to all the peripheral devices, the system is
called a symmetric multiprocessor.
When only one or a few processors can access the peripheral devices, the system is
called an asymmetric multiprocessor.
When a CPU wants to access a memory location, it checks if the bus is free, then it
sends the request to the memory interface module and waits for the requested data to be
available on the bus.
Multicore processors are small UMA multiprocessor systems, where the first shared
cache is actually the communication channel.
Figure shows two popular interconnection networks – the Omega network and the Butterfly
network, each of which is a multi-stage network formed of 2×2 switching elements. Each 2×2
switch allows data on either of the two input wires to be switched to the upper or the lower
output wire. In a single step, however, only one data unit can be sent on an output wire. So if the
data from both the input wires is to be routed to the same output wire in a single step, there is a
collision.
Figure: Interconnection networks for shared memory multiprocessor systems
Omega interconnection function
The Omega network which connects n processors to n memory units has n/2log2 n
switching elements of size 2×2 arranged in log2 n stages. Between each pair of adjacent stages
of the Omega network, a link exists between output i of a stage and the input j to the next stage
according to the following perfect shuffle pattern which is a left-rotation operation on the binary
representation of i to get j. The generation function is given as:
The routing function from input line i to output line j considers only j and the stage
number s, where s ∈ [0, log n – 1]. In a stage s switch, if the s + 1th most significant bit of
j is 0,the data is routed to the upper output wire, otherwise it is routed to the lower output wire.
Butterfly network
A butterfly network links multiple computers into a high-speed network. For a butterfly
network with n processor nodes, there need to be n (log n + 1) switching nodes. The generation
of the interconnection pattern between a pair of adjacent stages depends not only on n but also
on the stage numbers.In a stage (s) switch, if the s + 1th MSB of j is 0, the data is routed to the
upper output wire, otherwise it is routed to the lower output wire.
3. ArrayProcessors
They are a class of processors that executes one instruction at a time in an array or
table of data at the same time rather than on single data elements on a common clock.
They are also known as vector processors. An array processor implement the instruction set
where each instruction is executed on all data items associated and then move on the other
instruction. Array elements are incapable of operating autonomously, and must be driven by the
control unit.
Flynn’s Taxonomy
Flynn's taxonomy is a specific classification of parallel computer architectures that are based on
the number of concurrent instruction (single or multiple) and data streams (single or multiple)
available in the architecture.
Flynn's taxonomy based on the number of instruction streams and data streams are the
following:
1. (SISD) single instruction, single data
2. (MISD) multiple instruction, single data
3. (SIMD) single instruction, multiple
4. data multiple data (MIMD) multiple instruction,
1. SISD (Single Instruction, Single Data stream)
Single Instruction, Single Data (SISD) refers to an Instruction Set Architecture in which
a single processor (one CPU) executes exactly one instruction stream at a time.
It also fetches or stores one item of data at a time to operate on data stored in a single
memory unit.
Most of the CPU design is based on the von Neumann architecture and the follow SISD.
The SISD model is a non-pipelined architecture with general-purpose registers, Program
Counter (PC), the Instruction Register (IR), Memory Address Registers (MAR) and
Memory Data Registers (MDR).
Single Multiple
The degree of coupling among a set of modules, whether hardware or software, is measured in
terms of the interdependency and binding and/or homogeneity
among the modules.
The multiprocessor systems are classified into two types based on coupling:
1. Loosely coupled systems
2. Tightly coupled systems
Tightly Coupled systems:
Tightly coupled multiprocessor systems contain multiple CPUs that are connected at the
bus level with both local as well as central shared memory.
Tightly coupled systems perform better, due to faster access to memory and
intercommunication and are physiclly smaller and use less power. They areeconomically
costlier.
Tightly coupled multiprocessors with UMA shared memory may be either switch-based
(e.g., NYU Ultracomputer, RP3) or bus-based (e.g., Sequent, Encore).
Some examples of tightly coupled multiprocessors with NUMA shared memory or that
communicate by message passing are the SGI Origin 2000
Loosely Coupled systems:
Loosely coupled multiprocessors consist of distributed memory where each processor
has its own memory and IO channels.
The processors communicate with each other via message passing or interconnection
switching.
Each processor may also run a different operating system and have its own bus control logic.
Loosely coupled systems are less costly than tightly coupled systems, but are physically
bigger and have a low performance compared to tightly coupled systems.
The individual nodes in a loosely coupled system can be easily replaced and are usually
inexpensive.
The extra hardware required to provide communication between the individual
processors makes them complex and less portable.
Loosely coupled multicomputers without shared memory are physically co-located.
These may be bus-based (e.g., NOW connected by a LAN or Myrinet card) or using a
more general communication network.
These processors neither share memory nor have a common clock.
Loosely coupled multicomputers without shared memory and without common clock
and that are physically remote, are termed as distributed systems.
Parallelism or speedup of a program on specific system
It is the use of multiple processing elements simultaneously for solving any problem.
Problems are broken down into instructions and are solved concurrently as each resource
which has been applied to work is working at the same time.
This is a measure of the relative speedup of a specific program, on a given machine.
Thespeedup depends on the number of processors and the mapping
It is expressed as the ratio of the time T(1) witha single processor, to the time T(n) with
n processors.
Parallelism within a parallel/distributed program
This is an aggregate measure of the percentage of time that all the processors are
executing CPU instructions productively, as opposed to waiting for communication
operations.
Concurrency
Concurrent programming refer to techniques for decomposing a task into subtasks that
can execute in parallel and managing the risks that arise when the program executes
more than one task at the same time.
The parallelism or concurrency in a parallel or distributed program can be measured by
the ratio of the number of local non-communication and non-shared memory access
operations to the total number of operations, including the communication or shared
memory access operations.
Granularity
Granularity or grain size is a measure of the amount of work or computation that is performed
by that task.
Fig: Space time distribution of distributed systems
An internal event changes the state of the process at which it occurs. A send event
changes the state of the process that sends the message and the state of the channel on
which the message is sent.
A receive event changes the state of the process that receives the message and the state
of the channel on which the message is received.
When all the above conditions are satisfied, then it can be concluded that ab is casually
related. Consider two events c and d; cd and dc is false (i.e) they are not casually related,
then c and d are said to be concurrent events denoted as c||d.
LOGICAL TIME
Logical clocks are based on capturing chronological and causal relationships of processes and
ordering events based on these relationships.
Logical Clock
Physical Clock
A physical clock is a physical procedure A logical clock is a component for catching
combined with a strategy for measuring that sequential and causal connections in a dispersed
procedure to record the progression of time. framework.
The physical clocks are based on cyclic A logical clock allows global ordering on
processes such as a events from different processes.
celestial rotation.
A Framework for a system of logical clocks
A system of logical clocks consists of a time domain T and a logical clock C. Elements of T form a
partially ordered set over a relation <. This relation is usually called the happened before or
causal precedence.
The logical clock C is a function that maps an event e in a distributed system to an element
in the time domain T denoted as C(e).
such that
for any two events ei and ej,. eiej C(ei)< C(ej).
This monotonicity property is called the clock consistency condition.When T and C satisfy
the following condition,
Data structures:
Each process pimaintains data structures with the given capabilities:
• A local logical clock (lci), that helps process pi measure itsown progress.
• A logical global clock (gci), that is a representation of process pi’s local view of the logical
global time. It allows this process to assignconsistent timestamps to its local events.
Protocol:
• rules:
The protocol ensures that a process’s logical clock, and thus its view of theglobal time, is
managed consistently with the following
Rule 1: Decides the updates of the logical clock by a process. It controls send, receive and
other operations.
Rule 2: Decides how a process updates its global logical clock to update its view of the
global time and global progress. It dictates what information about the logical time is
piggybacked in a message and how this information is used by the receiving process to
update its view of the global time.
SCALAR TIME
Scalar time is designed by Lamport to synchronize all the events in distributed
systems. A Lamport logical clock is an incrementing counter maintained in each process.
This logical clock has meaning only in relation to messages moving between processes.
When a process receives a message, it resynchronizes its logical clock with that sender
maintainingcausal relationship.
The Lamport’s algorithm is governed using the following rules:
The algorithm of Lamport Timestamps can be captured in a few rules:
All the process counters start with value 0.
A process increments its counter for each event (internal event, message sending,
message receiving) in that process.
When a process sends a message, it includes its (incremented) counter value with the
message.
On receiving a message, the counter of the recipient is updated to the greater of its
current counter and the timestamp in the received message, and then incremented by
one.
2. Total Reordering:Scalar clocks order the events in distributed systems.But all the events
do not follow a common identical timestamp. Hence a tie breaking mechanism is essential to
order the events. The tie breaking is done through:
Linearly order process identifiers.
Process with low identifier value will be given higher priority.
The term (t, i) indicates timestamp of an event, where t is its time of occurrence and i is the
identity of the process where it occurred.
The total order relation ( ) over two events x and y with timestamp (h, i) and (k, j) is given by:
The vector clock of a system with N processes is a vector of N counters, one counter
per process. Vector counters have to follow the following update rules:
Initially, all counters are zero.
Each time a process experiences an event, it increments its own counter in the vector
by one.
Each time a process sends a message, it includes a copy of its own (incremented)
vector in the message.
Each time a process receives a message, it increments its own counter in the vector by
one and updates each element in its vector by taking the maximum of the value in its
own vector counter and the value in the vector in the received message.
The time domain is represented by a set of n-dimensional non-negative integer vectors in vector
time.
2. execute R1
3. deliver the message m
There is an isomorphism between the set of partially ordered events produced by a
distributed computation and their vector timestamps.
If the process at which an event occurred is known, the test to compare two
timestamps can be simplified as:
2. Strong consistency
The system of vector clocks is strongly consistent; thus, by examining the vector timestamp
of two events, we can determine if the events are causally related.
3. Event counting
If an event e has timestamp vh, vh[j] denotes the number of events executed by process pj
that causally precede e.
Processes can communicate with other Here, a process does not have private address
processes. They can be protected from one space. So one process can alter the execution
another by having private address spaces. of other.
Efficiency:
All remote data accesses are expicit and Any particular read or update may or may not
therefore the programmer is always aware of involve communication by the underlying
whether a particular operation is in-process or runtime support.
involves the expense of communication.
Blocking primitives
The primitive commands wait for the message to be delivered. The execution of the
processes is blocked.
The sending process must wait after a send until an acknowledgement is made by the
receiver.
The receiving process must wait for the expected message from the sending process
The receipt is determined by polling common buffer or interrupt
This is a form of synchronization or synchronous communication.
A primitive is blocking if control returns to the invoking process after the processing
for the primitive completes.
Asynchronous
A Send primitive is said to be asynchronous, if control returns back to the invoking
process after the data item to be sent has been copied out of the user-specified buffer.
It does not make sense to define asynchronous Receive primitives.
Implementing non -blocking operations are tricky.
For non-blocking primitives, a return parameter on the primitive call returns a
system-generated handle which can be later used to check the status of completion of
the call.
The process can check for the completion:
o checking if the handle has been flagged or posted
o issue a Wait with a list of handles as parameters: usually blocks until one of
the parameter handles is posted.
Fig a) Blocking synchronous send and blocking receive Fig b) Non-blocking synchronous send and
blocking receive
Non-blocking Receive:
The Receive call will cause the kernel to register the call and return the handle of
a location that the user process can later check for the completion of the non-
blocking Receive operation.
This location gets posted by the kernel after the expected data arrives and is
copied to the user-specified buffer. The user process can check for the completion
of the non-blocking Receive by invoking the Wait operation on the returned
handle.
Processor Synchrony
Processor synchrony indicates that all the processors execute in lock-step with their clocks
synchronized.
Since distributed systems do not follow a common clock, this abstraction is implemented using
some form of barrier synchronization to ensure that no processor begins executing the next step
of code until all the processors have completed executing the previous steps of code assigned to
each of the processors.
Libraries and standards
There exists a wide range of primitives for message-passing. The message-passing interface
(MPI) library and the PVM (parallel virtual machine) library are used largely by the scientific
community
Message Passing Interface (MPI): This is a standardized and portable message-passing
system to function on a wide variety of parallel computers. MPI primarily addresses the
message-passing parallel programming model: data is moved from the address space of
one process to that of another process through cooperative operations on each process.
The primary goal of the Message Passing Interface is to provide a widely used standard
for writing message passing programs.
Parallel Virtual Machine (PVM): It is a software tool for parallel networking of
computers. It is designed to allow a network of heterogeneous Unix and/or Windows
machines to be used as a single distributed parallel processor.
Remote Procedure Call (RPC): The Remote Procedure Call (RPC) is a common model
of request reply protocol. In RPC, the procedure need not exist in the same address
space as the calling procedure. The two processes may be on the same system, or they
may be on different systems with a network connecting them.
Remote Method Invocation (RMI): RMI (Remote Method Invocation) is a way that a
programmer can write object-oriented programming in which objects on different
computers can interact in a distributed network. It is a set of protocols being developed
by Sun's JavaSoft division that enables Java objects to communicate remotely with other
Java objects.
Remote Procedure Call (RPC): RPC is a powerful technique for constructing
distributed, client-server based applications. In RPC, the procedure need not exist in the
same address space as the calling procedure. The two processes may be on the same
system, or they may be on different systems with a network connecting them. By using
RPC, programmers of distributed applications avoid the details of the interface with the
network. RPC makes the client/server model of computing more powerful and easier to
program.
A system that supports the causal ordering model satisfies the followingproperty:
GLOBAL STATE
Distributed Snapshot represents a state in which the distributed system might have been in. A
snapshot of the system is a single configuration of the system.
The state of a process at any time is defined by the contents of processor registers,
stacks, local memory, etc. and depends on the local context of the distributed
application.
The global state of a distributed system is a collection of the local states of its
components, namely, the processes and the communication channels.
The state of a channel is given by the set of messages in transit in the channel.
The state of a channel is difficult to state formally because a channel is a distributed entity
and its state depends upon the states of the processes it connects. Let
denote the state of a channel Cij defined asfollows:
A distributed snapshot should reflect a consistent state. A global state is consistent if it could
have been observed by an external observer. For a successful Global State, all states must be
consistent:
If we have recorded that a process P has received a message from a process Q, then
we should have also recorded that process Q had actually send that message.
Otherwise, a snapshot will contain the recording of messages that have been received
but never sent.
The reverse condition (Q has sent a message that P has not received) is allowed.
The notion of a global state can be graphically represented by what is called a cut. A cut
represents the last event that has been recorded for each process.
The history of each process if given by:
Each event either is an internal action of the process. We denote by si k the state of process pi
immediately before the kth event occurs. The state si in the global state S corresponding to
the cut C is that of pi immediately after the last event processed by pi in the cut – eici . The
set of events eici is called the frontier of the cut.
Cut is pictorially a line slices the space–time diagram, and thus the set of events in the
distributed computation, into a PAST and a FUTURE. The PAST contains all the events to
the left of the cut and the FUTURE contains all the events to the right of the cut. For a cut C,
let PAST(C) and FUTURE(C) denote the set of events in the PAST and FUTURE of C,
respectively.
Consistent cut: A consistent global state corresponds to a cut in which every message
received in the PAST of the cut was sent in the PAST of that cut.
Inconsistent cut: A cut is inconsistent if a message crosses the cut from the FUTURE to the
PAST.
PAST AND FUTURE CONES OF AN EVENT
In a distributed computation, an event ej could have been affected only by all events
ei, such that ei → ej and all the information available at ei could be made accessible at ej. In
other word ei and ej should have a causal relationship. Let Past(ej) denote all events in the
past of ej in any computation.
The term max(past(ei)) denotes the latest event of process pi that has affected ej. This will
always be a message sent event.
A cut in a space-time diagram is a line joining an arbitraryPoint on each process line that
slices the space-time diagram into a PAST and a FUTURE. A consistent global state
corresponds to a cut in which every message received in the PAST of the cut was sent in the
PAST of that cut.
The future of an event ejdenoted by Future(ej) contains all the events ei that are
casually affected by ej.
Futurei(ei ) is the set of those events of Future (ej) are the process pi and min(Futurei(ej)) as
the first event on process pi that is affected by ej. All events at a process pi that occurred
afterMax(Past(ej)) but before min(Futurei(ej)) are concurrent with ej.
MODELS OF PROCESS COMMUNICATIONS
There are two basic models of process communications
Synchronous: The sender process blocks until the message has been received by the
receiver process. The sender process resumes executiononly after it learns that the receiver
process has accepted the message. The sender and the receiver processes must synchronize
to exchange a message.
Asynchronous: It is non- blocking communication where the sender and the receiver do not
synchronize to exchange a message. The sender process does not wait for the message to be
delivered to the receiver process. The message is buffered by the system and is delivered to
the receiver process when it is ready to accept the message. A buffer overflow may occur if
a process sends a large number of messages in a burst to another process, thus causing a
message burst.
Centralized systems do not need clock synchronization, as they work under a common
clock. But the distributed systems do not follow common clock: each system functions based
on its own internal clock and its own notion of time.The time in distributed systems is
measured in the following contexts:
The time of the day at which an event happened on a specific machine in the network.
The time interval between two events that happened on different machines in the
network.
The relative ordering of events that happened on different machines in the network.
Clock synchronization is the process of ensuring that physically distributed processors have a
common notion of time.
Due to different clocks rates, the clocks at various sites may diverge with time, and
periodically a clock synchronization must be performed to correct this clock skew in
distributed systems. Clocks are synchronized to an accurate real-time standard like UTC
(Universal Coordinated Time). Clocks that must not only be synchronized with each other
but also have to adhere to physical time are termed physical clocks. This degree of
synchronization additionally enables to coordinate and schedule actions between multiple
computers connected to a common network.
Basic terminologies:
If Ca and Cb are two different clocks, then:
Time: The time of a clock in a machine p is given by the function Cp(t),where Cp(t)= t
for a perfect clock.
Frequency: Frequency is the rate at which a clock progresses. The frequency at time t
of clock CaisCa’(t).
Offset:Clock offset is the difference between the time reported by a clockand the real
time. The offset of the clock Ca is given by Ca(t)− t. Theoffset of clock C a relative to
Cb at time t ≥ 0 is given by Ca(t)- Cb(t)
Skew: The skew of a clock is the difference in the frequencies of the clockand the
perfect clock. The skew of a clock Ca relative to clock Cb at timet is Ca’(t)- C ’b(t).
Drift (rate): The drift of clock Ca the second derivative of the clockvalue with respect
to time. The drift is calculated as:
Clocking Inaccuracies
Physical clocks are synchronized to an accurate real-time standard like UTC
(Universal Coordinated Time). Due to the clock inaccuracy discussed above, a timer (clock)
is said to be working within its specification if:
Let T1, T2, T3, T4 be the values of the four mostrecent timestamps. The clocks A and B are stable
andrunning at the same speed. Let a = T1 − T3 and b = T2 − T4. If the networkdelay difference from A to
B and from B to A, called differential delay, is small, the clock offset
and roundtrip delay of B relative to A at time T4are approximately given by the
following:
Each NTP message includes the latest three timestamps T1, T2, andT3, while T4 is
determined upon arrival.
Asynchronous Execution:
A communication among processes is considered asynchronous, when every communicating
process can have a different observation of the order of the messages being exchanged. In an
asynchronous execution:
there is no processor synchrony and there is no bound on the drift rate of processor clocks
message delays are finite but unbounded
no upper bound on the time taken by a process
Synchronous Execution:
A communication among processes is considered synchronous when every process
observes the same order of messages within the system. In the same manner, the execution is
considered synchronous, when every individual process in the system observes the same total
order of all the processes which happen within it. In an synchronous execution:
processors are synchronized and the clock drift rate between any two processors is
bounded
message delivery times are such that they occur in one logical step or round
upper bound on the time taken by a process to execute a step.
Fig: Synchronous execution
Emulating an asynchronous system by a synchronous system (A → S)
An asynchronous program can be emulated on a synchronous system fairly trivially as
the synchronous system is a special case of an asynchronous system – all communication
finishes within the same round in which it is initiated.
Emulating a synchronous system by an asynchronous system (S → A)
Real-time scheduling becomes more challenging when a global view of the system state is
absent with more frequent on-line or dynamic changes. The message propagation delays which
are network-dependent are hard to control or predict. This is an hindrance to meet the QoS
requirements of the network.
Performance
User perceived latency in distributed systems must be reduced. The common issues in
performance:
Metrics: Appropriate metrics must be defined for measuring the performance of
theoretical distributed algorithms and its implementation.
Measurement methods/tools: The distributed system is a complex entity appropriate
methodology and tools must be developed for measuring the performance metrics.