Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Distributed System Notes

Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Distributed System Notes

Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 10

Data is extremely large for one machine to handle and deliver the performance

needed.

Hence, always think of having ways to handle such situations with the help of
multiple computers. WHen multiple computers are involved in solving a single
problem,
communication between then, performance of each system, how failure of one system
affectes the remaing parts of the solution, how communication system failure
affects
the system availability etc and more will come to picture.

This is just like a cricket team selection. 11 members are needed to play the game.
back up for each player will be there. if one player performs badly cnsistently,
new high performing player takes his place etc.

each player has a designated role, eg: wicket keeper. A bowler may not know wikect
keeping. THis separation of concerns is the first step.

Eg: Compute, storage, communication - on the infrastructure side


front end, middleware, backend, load balancer, ingress controller etc on the
application side.

So we can think of two aspects of distributed system.

1. Infrastructure
2. Application.

But there are common principles in both cases.

when we use distributed systems to solve problems that require the power of a
distributed system or problems that are inherently distributed,

on the infrastrucuture side, we have multiple servers ( compute ) available to run


your load. So when you develop your applications, you have to think of leveraging
this capability offered by the hardware. This can be achived by Threaded
programming for concurrency, or by using multiple processes. Weigh the options and
use wisely in application development.

modularize the program as isolated components ( separation of concerns ) and enable


communication between them - fundamentals of micro service design.

Now two important aspects - Communication between different pieces of solution -


RPC
- While concurrency improves performance, it can lead to
inconsistencies as well. For example when two threads ( one thread doent know about
the existeance of another thread in real world. It is user, with the help of os
thread scheduling constructus such as mutex, semaphore, condition variable etc
manages their co-existeance in a way that doesn't break the consistency of program
and data )try to manipulate same data, it can lead to inconsistencies depending on
the order in which threads access the data. Developer has to be mindful of these in
his logic.

Effective utilization of compute resource as


well as controlling number of connections to the application/ customer deivces
(maximum number of devices that can be simultaneously accessed from your
application so that parallelism using threads doesn't break the solution),
conditional waiting for responses in a thread ( like waiting for response from a
device - The thread can be moved to BLOCKED [ aka WAITING ] state so that other
threads can be scheduled for cpu resources ie, other threads can be made RUNNING )
all these will be there. Along with these solutions to leverage maximum from the
compute, problems associated with it also can come.
1. Race condition - two threads competing to
get access to a shared resource ( may be a piece of code, hardware resource etc )
- the order of execution of
threads can impact the result.
- Solution:
Developer has to think about these situations and efficiently apply techniques
offered by operating system and programming language to avoid these situations.
The solution that
programmer has to implement in his code when dealing with race condition are mutex,
semaphore, conditional variable.
These techniques
are called Thread synchronization.

All these are OS


supported features.

Mutex - Mutual
Exclusion. Threads has to acquire a lock to access a shared resource and other
threads has to wait untill the lock is released by the thread that acquired the
lock. THis ensures only one thread can access a critical code/resource at a time.

Semaphore -
involves a counter that decided the maximum number of simultaneous access to the
shared resource ( eg: device connections ).
The counter is
incremented for condition check to see whether max is reached or not when a thread
access the shared resource.
When a thread
releases the resource, counter is decremented.

conditional
variable - A condition variable allows threads to wait for a specific condition to
be true. It’s like waiting for a signal. A condition variable always need to be
associated with a lock. Condition provides extra features to lock ( a mutex lock )
to wait, signal and get signalled when a condition changes. For example, a shared
buffer and a producer and consumer.

Example:

with lock:
while
not condition:

condition_variable.wait()

# Condition met,
proceed with operation

condition_variable.notify()

----------------------------

more details:
=============
lock =
threading.Lock():

This creates a
basic mutual exclusion lock.
Any thread that
wants to enter a critical section (access shared data) has to acquire the lock
before proceeding. If another thread already holds the lock, the thread has to wait
until the lock is released.
condition =
threading.Condition(lock):

This creates a
condition variable that is associated with the same lock (lock).
A condition
variable allows threads to wait for a specific condition to be true (for example, a
buffer not being empty). While waiting, the thread releases the lock temporarily
(internally), allowing other threads to acquire the lock and potentially change the
state of the shared resource (e.g., adding an item to the buffer).
When the condition
is met, another thread can call condition.notify() or condition.notify_all() to
wake up the waiting threads.

Key difference:
Unlike a plain lock, which just ensures mutual exclusion, a condition variable
provides a way to coordinate threads. The lock ensures safety for shared data,
while the condition allows threads to signal each other about state changes.

To get maximum out of the compute power, developers need to use techniques such as
multiprocess or multi thread applications.
This improves application performance. Shared resource management is done in multi
threaded applications using thread synchronization techniques.
However, improper usage of thread syncronization techniques can lead to problems
such as race condition, dead lock etc. Hence developer has to keep this in mind and
use thread syncronization techniques properly.

To make client server communication feels like the server application is running in
the same machine as client, RPCs are introduced.

RPC (Remote Procedure Call) is a protocol or communication model that allows a


program to execute a procedure (or function) on another system (usually on a remote
machine) as if it were a local procedure call. The key advantage of RPC is that it
abstracts the complexity of network communication, so developers can write code as
though the procedure being called is on the same system, even though it's actually
on a different machine.

Key Features of RPC:


Transparency:

RPC abstracts the details of the network communication from the developer. The
syntax and behavior of the function call appear identical to a local procedure
call. The underlying system handles the network communication (such as marshaling
the data, sending it over the network, and waiting for a response).
Client-Server Model:

RPC is inherently a client-server communication model:


Client: The system making the procedure call.
Server: The system executing the requested procedure.
The client sends a request to the server to execute a procedure with the necessary
parameters, and the server processes the request and returns the result back to the
client.
Marshaling and Unmarshaling:

Marshaling: Converting the function parameters and arguments into a format that can
be transmitted over a network (serializing the data).
Unmarshaling: Converting the received data back into the original format for
execution on the server.
This allows the function parameters to be transmitted between different systems,
even if they have different architectures or data formats.
Synchronous/Asynchronous Communication:

RPC can be synchronous or asynchronous:


Synchronous RPC: The client waits for the server to complete the procedure and
return a response before continuing its execution.
Asynchronous RPC: The client does not wait for a response from the server and
continues its execution. The server's response is handled separately.
Error Handling:

Since RPC involves communication between two systems over a network, it must handle
additional errors such as network failures, server unavailability, timeouts, and
data corruption.

RPC Frameworks and Technologies:


gRPC (Google RPC):

A high-performance, open-source RPC framework developed by Google.


Supports multiple languages, and it's based on Protocol Buffers (protobuf) for data
serialization.
It provides features like load balancing, authentication, cancellation, and timeout
control.
XML-RPC:

An RPC protocol that uses XML to encode its calls and HTTP as the transport
protocol.
It is simple but lacks some of the advanced features found in modern RPC
frameworks.
JSON-RPC:

Similar to XML-RPC, but it uses JSON for encoding calls instead of XML.
Easier to work with compared to XML-RPC, due to JSON's simplicity and popularity.
CORBA (Common Object Request Broker Architecture):

A standard defined by the Object Management Group (OMG) that allows programs to
communicate with one another regardless of where they are located or who has
created them.
It is an older technology and less commonly used in modern applications.
Thrift:

An RPC framework originally developed by Facebook.


It supports multiple languages and includes a binary protocol for efficient
communication.
So far on the infra side we discussed, compute and communication. Now lets look at
storage.

Starting point of any distributed system is to harness the aggregate performance


improvements that can be achieved when a work is done using 100s of computers
working together.

Same applies to data. When you got the power of multiple computers, you split the
data among them. This is called sharding.

so in the case od data( ie storage), performance -> sharding

Now challenges, 100s of servers handling data means faults can occur ( any device
can go down, for example ). having human involvement all the time when a fault
occurs is not a feasible idea. Hence we need automatic fault tolerance. One of the
ways to be tolerant to failures is by keeping replications.
Replications can cause data inconsistencies. TO have consistency, you need
mechanisms and that reduces performance.

Why Big data handling hard?

Intention: Harness the power of 100s of computers in processing the data and therby
increased performance.

Performance ==> Done by Sharding the data and sharing it among compute devices.

Sharding and working with 100s of servers ==> Faults

Faults ==> Fault tolerance mechanisms required.

Fault tolerance ==> Achieved via replications ( multiple copies of the same data )

Replication ==> Can cause data inconsistencies among copies

Inconsistencies ==> Require mechanisms to handle it which reduces performance.

( This implies Strong data consistency means low performance. Weak consistency
means better performance. People prefer weak consistency if possible.
Strong consistency means the behaviour is as consistent as the situation where your
data is residing in a single machine, with one copy of data and doing one thing at
a time.
Emulating that behavior in a distributed system with 100s of servers is a tedious
task and lowers the performance. )
-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
---------

Primary-backup replications
===========================

What kind of failures replican can be expected to solve? It is not supposed to


solve all types of failures.

Compare with fail-stop failures of a single computer.

Fail-stop in terms of fault tolerance means that if something wrong goes with
the computer, the computer stops executing. It doesnt compute incorrect results
( leading to incorrect state ). How to simulate? Eg: power failures on a single
computer.

Fail stop situation can be dealth with replication. That fault tolerance is
expected to handle fail stop errors.
Replication is can not solve faults due to software/hardware bugs.
Also, errors occurring to packets during transit or disk errors during
storage has to be dealt by error correcting codes such as CRC.

Replication is expected to deal with failures that can cause one data center
to go doen like natural disasters.

Do you need replicas?


depends on the cost and impact of failure.

Replication schemes:

1. state transfer : Primary sends a copy of its entire state; ie the contents
of its RAM, to the backup. This way both priamary and backup remain in sync. Large
data transfer required between master and backup ( for example GB size ). An
efficient way to think of is that only the parts that chnaged since last backup
needs to be transferred.

This approach of sending the minimum required changes with respect to


last known good state ( min-diff calculation ) is utilized in NSO-to device
communication.
NSO-NSO is done via HA package and need to check if it also uses
similar mechanism for efficient backup.

2. Replicated state machines: Idea is that most services/computer science you


want to replicate have some internal operations that is deterministic except when
an input comes in ( That is if two machines start at the same copy and start
executing from the same point, they will produce deterministic same outputs on both
the machines untill the course of execution is interrrupted by an external event
like input.)

Relicated state machines dont send the entire state between primary and
backups. Instead primary only send the external events such as arriving input from
the outside world.
Assumption is that if two or more machines are in the same state and they all
got the same input at the same time, they all continue to be in the same state.
In a nutshell,

state transfer approach of replication ==> transfers the enitre


contents of memory/ or changes since last synchronized to backups. ( large data
transfer might be required.)
Relicated state machines approach ==> transfers the operations received
from the clients/external systems from primary to backups. ( small data transfer
compared to state transfer approach)

What if primary and backup are not the same due to some failures?

Consider a situation where GFS ( where master controls who is the primary based on
the lease duration. WHen the lease expires, the primary at that point of time
informs master that its lease is expired and no longer the primary. Master then
allocates the lease to another shard/chunk server and it becomes the master. Assume
that each chunk server is replicated for fault tolerance.
Now consider the scenario where the primary replica of the primary GFS chunk
[ primary GFS chunk is decided by GFS master and primary of the chunks' replica is
decided within the chuk replication ] sends notification to GFS master that its
lease exipred and it is no longer the primary chunk server. After that it failed
and secondary replica takes over as the primary of that chunk. In this case the
problem is the at what ever external communication just happened is not reflected
in the secondary. hence when it took over the charge, it doesnt know about the
timer expiry and thinks that it is still the primary for the GFS. By this time GFS
master would have allocated primary status to another chunk leading to two GS
primary chunks. Replication fault tolerance should be designed to handle these edge
cases.

state transfer is good for multi core and parallesim.

Deep dive into a Replicated state machine


=========================================

What do you mean by state?


In this case you have to understand that primary sees the input first and it must
be slightly ahead. How ahead? or How close the synchronization is?
If primary fails, there has to be a scheme for switching over ( cut over ).
Due to primary being ahead of secondary, cut-ver will definitely create Anomalies.
Replication fault tolerance has to deal with this problem.
Need to get the failed replica/ or a new replica back online as fast as possible (
in the application world, kubernetes replicaset/deployment controllers does this
job )

Replicated state machine approach is a checp way of achieving synchronization. But


if you want to create a new replica, state transfer has to be done to start the
process.

You also have to understand at what level replication is happening? For example, in
GFS replication is only happening for the chunk data. It doesnt copy the state of
the machine on which the chunk was running. That it doesn't care about at what
point an external input is handled in primary versus back up.

But in the case of other systems like VMWare virtualization, they care about these
deep level while considering the backup.

VMWare Fault TOlerance ( Deep backup - Replicated state machine approach ) Even the
memory registers has to be the same
-----------------------------------------------------------------------------------
------------------------------------

Scenario: Two VMS running some app. Both share disk from a disk server cluster
available via network.

When a client send an input to primary, the Hypervisor gets it and send a copy of
that to backup VM as well. Then primary send it to app in the VM via the guest OS
in that VM and subsequently returns the result to client. Backup also does the same
sequence of events but the hypervisor knows that it is the back up and drops the
response to cleint.

The process of exchanging these messages between primary VM and backup VM is called
LOG ENTRIES or LOG CHANNELs. LOG CHANNEL is a logical connection between primary VM
and backup VM through which the messages/external events are communicated to
backup.

Now come to the fault tolerance part of it.

What faults can occur? - Primary can fail, backup can fail. but interesting case is
primary fails.
This means back is not getting anything on the LOG CHANNEL.

Only thing we can infer when back up is not seeing any "keep alive" in the LOG
CHANNEL is that primary is not available. It can't say that the primary is crashed.
It can only infer that the primary is unavailable. In this case, backup took over
the charge of primary. This happens by the hypervisor of backup machine letting the
backup to act freely with having to wait for entries from log channel. Backup hast
to act as if it is the only server handling the client request. Same process if
back up fails, primary has to act in similar way.

Non-deterministinc events that can happen in a Replicated state machine ( RSMs are
deterministic but for example an input req from a client comes when it comes )
-----------------------------------------------------------------------

INPUT - the only mode of comms between client and server is via network. Therefore
when you say input arrived, you should think of it as packets arriving.

INPUT ==> Packet ==> implies data in the packet.


Also when a packet arrived, an interrupt has to inform that a packet arrived.
Interrup has to happen exactly at the same time in both primary and secondary other
wise the state will start diverging. This means that interrupt has to exactly occur
at the same time and same instruction point in both machines.

A mental model for primary and secondary RSM you can use it that both are busy
running some infinite set of assembly instructions like LDA 5000, STA 6000 etc.
While these machines are busy with these operations, an interrupt occurs. they will
process the interrupt and then continue. But what instruction these machines where
performing when interrupt occurred can affect the next state. For both to be
synchronized, they both have to handle it exactly at the same instruction point.

In addition to that there can be a few instructions like rand() that behaves
differently on different machines.

Multi core parallelism can also affect RSM. For now the abstraction that hypervisor
provides to the gues OS is of a single core machine.

Now lets look at the LOG ENTRIES for the above.

The LOG ENTRIES consists of


instruction number ( the number of instutions executed since the server
started )
type: example ( network input )
data ( for a packet arrival, it is the data in the packet. if it is a weired
instruction like rand(), data will be the execution result from primary )

How primary and backup are time synchronized?

Primary VM's physical server send a clock interrupt every 100 seconds ( as an
example ) to the hypervisor, hypervisor send that interrupt to guest OS and get the
instruction number at which this interrupt is processed and then sends all these
details through the LOG channel to secondary. Secondary only processes this timer,
and ignores the timer interrupt from its own physical machine.

How do we ensure that backup is not getting ahead of primary by the time an
interrupt occurs?

VMWare hypervisor keeps a buffer for storing interrrups coming in log channel. If
there is atleast one interrupt waiting to be executed in that buffer, the secondary
will not execute an thing from its normal course of action. This way the seconday
is always catching up primary and not going ahead.

Imagine a packet arrived and if we directly let the gues OS to read that info from
host machines's buffer, then there is no way to let secondary know at what
instruction point the interrupt occurred. Hence it has to go via hypervisor only.

Also assume 99.99 percent of traffic in LOG CHANNEL is generated from incoming
packet stream and not random instructions.

Fault tolerance edges ==> Failure at an awkward time.

For example, if primary responded to client but then primary crashed and LOG
CHANNEL messagfe also not delivered to secondary does to some netwrok issue, there
is a replication failure leading to inconsistent data now.

For example if there is avalue that need to be incremented ( say from 1o to 11 ),


primary did that and above situation happend. The next time when the client send
another increment req ( may another client also ), they receive the same data as
previous case since secondary didn't knew about it. It is funny to think from
client's perspective to understand everything about you replication and fault
tolerent issue. Hence the replication has to deal with this failure.

You have to implement a solution. VMWare Fault tolerance uses a technique called
Output Roll.
This technique means that you have to send the response to client only when you get
an acknowledgement from secondary.

You might also like