Distributed System Notes
Distributed System Notes
needed.
Hence, always think of having ways to handle such situations with the help of
multiple computers. WHen multiple computers are involved in solving a single
problem,
communication between then, performance of each system, how failure of one system
affectes the remaing parts of the solution, how communication system failure
affects
the system availability etc and more will come to picture.
This is just like a cricket team selection. 11 members are needed to play the game.
back up for each player will be there. if one player performs badly cnsistently,
new high performing player takes his place etc.
each player has a designated role, eg: wicket keeper. A bowler may not know wikect
keeping. THis separation of concerns is the first step.
1. Infrastructure
2. Application.
when we use distributed systems to solve problems that require the power of a
distributed system or problems that are inherently distributed,
Mutex - Mutual
Exclusion. Threads has to acquire a lock to access a shared resource and other
threads has to wait untill the lock is released by the thread that acquired the
lock. THis ensures only one thread can access a critical code/resource at a time.
Semaphore -
involves a counter that decided the maximum number of simultaneous access to the
shared resource ( eg: device connections ).
The counter is
incremented for condition check to see whether max is reached or not when a thread
access the shared resource.
When a thread
releases the resource, counter is decremented.
conditional
variable - A condition variable allows threads to wait for a specific condition to
be true. It’s like waiting for a signal. A condition variable always need to be
associated with a lock. Condition provides extra features to lock ( a mutex lock )
to wait, signal and get signalled when a condition changes. For example, a shared
buffer and a producer and consumer.
Example:
with lock:
while
not condition:
condition_variable.wait()
# Condition met,
proceed with operation
condition_variable.notify()
----------------------------
more details:
=============
lock =
threading.Lock():
This creates a
basic mutual exclusion lock.
Any thread that
wants to enter a critical section (access shared data) has to acquire the lock
before proceeding. If another thread already holds the lock, the thread has to wait
until the lock is released.
condition =
threading.Condition(lock):
This creates a
condition variable that is associated with the same lock (lock).
A condition
variable allows threads to wait for a specific condition to be true (for example, a
buffer not being empty). While waiting, the thread releases the lock temporarily
(internally), allowing other threads to acquire the lock and potentially change the
state of the shared resource (e.g., adding an item to the buffer).
When the condition
is met, another thread can call condition.notify() or condition.notify_all() to
wake up the waiting threads.
Key difference:
Unlike a plain lock, which just ensures mutual exclusion, a condition variable
provides a way to coordinate threads. The lock ensures safety for shared data,
while the condition allows threads to signal each other about state changes.
To get maximum out of the compute power, developers need to use techniques such as
multiprocess or multi thread applications.
This improves application performance. Shared resource management is done in multi
threaded applications using thread synchronization techniques.
However, improper usage of thread syncronization techniques can lead to problems
such as race condition, dead lock etc. Hence developer has to keep this in mind and
use thread syncronization techniques properly.
To make client server communication feels like the server application is running in
the same machine as client, RPCs are introduced.
RPC abstracts the details of the network communication from the developer. The
syntax and behavior of the function call appear identical to a local procedure
call. The underlying system handles the network communication (such as marshaling
the data, sending it over the network, and waiting for a response).
Client-Server Model:
Marshaling: Converting the function parameters and arguments into a format that can
be transmitted over a network (serializing the data).
Unmarshaling: Converting the received data back into the original format for
execution on the server.
This allows the function parameters to be transmitted between different systems,
even if they have different architectures or data formats.
Synchronous/Asynchronous Communication:
Since RPC involves communication between two systems over a network, it must handle
additional errors such as network failures, server unavailability, timeouts, and
data corruption.
An RPC protocol that uses XML to encode its calls and HTTP as the transport
protocol.
It is simple but lacks some of the advanced features found in modern RPC
frameworks.
JSON-RPC:
Similar to XML-RPC, but it uses JSON for encoding calls instead of XML.
Easier to work with compared to XML-RPC, due to JSON's simplicity and popularity.
CORBA (Common Object Request Broker Architecture):
A standard defined by the Object Management Group (OMG) that allows programs to
communicate with one another regardless of where they are located or who has
created them.
It is an older technology and less commonly used in modern applications.
Thrift:
Same applies to data. When you got the power of multiple computers, you split the
data among them. This is called sharding.
Now challenges, 100s of servers handling data means faults can occur ( any device
can go down, for example ). having human involvement all the time when a fault
occurs is not a feasible idea. Hence we need automatic fault tolerance. One of the
ways to be tolerant to failures is by keeping replications.
Replications can cause data inconsistencies. TO have consistency, you need
mechanisms and that reduces performance.
Intention: Harness the power of 100s of computers in processing the data and therby
increased performance.
Performance ==> Done by Sharding the data and sharing it among compute devices.
Fault tolerance ==> Achieved via replications ( multiple copies of the same data )
( This implies Strong data consistency means low performance. Weak consistency
means better performance. People prefer weak consistency if possible.
Strong consistency means the behaviour is as consistent as the situation where your
data is residing in a single machine, with one copy of data and doing one thing at
a time.
Emulating that behavior in a distributed system with 100s of servers is a tedious
task and lowers the performance. )
-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
---------
Primary-backup replications
===========================
Fail-stop in terms of fault tolerance means that if something wrong goes with
the computer, the computer stops executing. It doesnt compute incorrect results
( leading to incorrect state ). How to simulate? Eg: power failures on a single
computer.
Fail stop situation can be dealth with replication. That fault tolerance is
expected to handle fail stop errors.
Replication is can not solve faults due to software/hardware bugs.
Also, errors occurring to packets during transit or disk errors during
storage has to be dealt by error correcting codes such as CRC.
Replication is expected to deal with failures that can cause one data center
to go doen like natural disasters.
Replication schemes:
1. state transfer : Primary sends a copy of its entire state; ie the contents
of its RAM, to the backup. This way both priamary and backup remain in sync. Large
data transfer required between master and backup ( for example GB size ). An
efficient way to think of is that only the parts that chnaged since last backup
needs to be transferred.
Relicated state machines dont send the entire state between primary and
backups. Instead primary only send the external events such as arriving input from
the outside world.
Assumption is that if two or more machines are in the same state and they all
got the same input at the same time, they all continue to be in the same state.
In a nutshell,
What if primary and backup are not the same due to some failures?
Consider a situation where GFS ( where master controls who is the primary based on
the lease duration. WHen the lease expires, the primary at that point of time
informs master that its lease is expired and no longer the primary. Master then
allocates the lease to another shard/chunk server and it becomes the master. Assume
that each chunk server is replicated for fault tolerance.
Now consider the scenario where the primary replica of the primary GFS chunk
[ primary GFS chunk is decided by GFS master and primary of the chunks' replica is
decided within the chuk replication ] sends notification to GFS master that its
lease exipred and it is no longer the primary chunk server. After that it failed
and secondary replica takes over as the primary of that chunk. In this case the
problem is the at what ever external communication just happened is not reflected
in the secondary. hence when it took over the charge, it doesnt know about the
timer expiry and thinks that it is still the primary for the GFS. By this time GFS
master would have allocated primary status to another chunk leading to two GS
primary chunks. Replication fault tolerance should be designed to handle these edge
cases.
You also have to understand at what level replication is happening? For example, in
GFS replication is only happening for the chunk data. It doesnt copy the state of
the machine on which the chunk was running. That it doesn't care about at what
point an external input is handled in primary versus back up.
But in the case of other systems like VMWare virtualization, they care about these
deep level while considering the backup.
VMWare Fault TOlerance ( Deep backup - Replicated state machine approach ) Even the
memory registers has to be the same
-----------------------------------------------------------------------------------
------------------------------------
Scenario: Two VMS running some app. Both share disk from a disk server cluster
available via network.
When a client send an input to primary, the Hypervisor gets it and send a copy of
that to backup VM as well. Then primary send it to app in the VM via the guest OS
in that VM and subsequently returns the result to client. Backup also does the same
sequence of events but the hypervisor knows that it is the back up and drops the
response to cleint.
The process of exchanging these messages between primary VM and backup VM is called
LOG ENTRIES or LOG CHANNELs. LOG CHANNEL is a logical connection between primary VM
and backup VM through which the messages/external events are communicated to
backup.
What faults can occur? - Primary can fail, backup can fail. but interesting case is
primary fails.
This means back is not getting anything on the LOG CHANNEL.
Only thing we can infer when back up is not seeing any "keep alive" in the LOG
CHANNEL is that primary is not available. It can't say that the primary is crashed.
It can only infer that the primary is unavailable. In this case, backup took over
the charge of primary. This happens by the hypervisor of backup machine letting the
backup to act freely with having to wait for entries from log channel. Backup hast
to act as if it is the only server handling the client request. Same process if
back up fails, primary has to act in similar way.
Non-deterministinc events that can happen in a Replicated state machine ( RSMs are
deterministic but for example an input req from a client comes when it comes )
-----------------------------------------------------------------------
INPUT - the only mode of comms between client and server is via network. Therefore
when you say input arrived, you should think of it as packets arriving.
A mental model for primary and secondary RSM you can use it that both are busy
running some infinite set of assembly instructions like LDA 5000, STA 6000 etc.
While these machines are busy with these operations, an interrupt occurs. they will
process the interrupt and then continue. But what instruction these machines where
performing when interrupt occurred can affect the next state. For both to be
synchronized, they both have to handle it exactly at the same instruction point.
In addition to that there can be a few instructions like rand() that behaves
differently on different machines.
Multi core parallelism can also affect RSM. For now the abstraction that hypervisor
provides to the gues OS is of a single core machine.
Primary VM's physical server send a clock interrupt every 100 seconds ( as an
example ) to the hypervisor, hypervisor send that interrupt to guest OS and get the
instruction number at which this interrupt is processed and then sends all these
details through the LOG channel to secondary. Secondary only processes this timer,
and ignores the timer interrupt from its own physical machine.
How do we ensure that backup is not getting ahead of primary by the time an
interrupt occurs?
VMWare hypervisor keeps a buffer for storing interrrups coming in log channel. If
there is atleast one interrupt waiting to be executed in that buffer, the secondary
will not execute an thing from its normal course of action. This way the seconday
is always catching up primary and not going ahead.
Imagine a packet arrived and if we directly let the gues OS to read that info from
host machines's buffer, then there is no way to let secondary know at what
instruction point the interrupt occurred. Hence it has to go via hypervisor only.
Also assume 99.99 percent of traffic in LOG CHANNEL is generated from incoming
packet stream and not random instructions.
For example, if primary responded to client but then primary crashed and LOG
CHANNEL messagfe also not delivered to secondary does to some netwrok issue, there
is a replication failure leading to inconsistent data now.
You have to implement a solution. VMWare Fault tolerance uses a technique called
Output Roll.
This technique means that you have to send the response to client only when you get
an acknowledgement from secondary.