Module1_DistributedSystemModels
Module1_DistributedSystemModels
Module 1
In this chapter
• This chapter presents the evolutionary changes that have occurred in parallel, distributed,
and cloud computing over the past 30 years, driven by applications with variable workloads
and large data sets.
• We study both high-performance and high-throughput computing systems in parallel
computers appearing as computer clusters, service- oriented architecture, computational
grids, peer-to-peer networks, Internet clouds, and the Internet of Things.
• These systems are distinguished by their hardware architectures, OS platforms, processing
algorithms, communication protocols, and service models applied. We also introduce
essential issues on the scalability, performance, availability, security, and energy efficiency
in distributed systems.
Scalable Computing over the Internet
For example, it can refer to the capability of a system to increase its total output under an
increased load when resources (typically hardware) are added.
The Age of Internet Computing
• Billions of people use the Internet every day. As a result, supercomputer sites and
large data centers must provide high-performance computing services to huge numbers of
Internet users concurrently Because of this high demand, high-performance computing
(HPC)applications is no longer optimal for measuring system performance.
• The emergence of computing clouds instead demands high-throughput computing (HTC)
systems built with parallel and distributed computing technologies.
The Platform Evolution
Computer technology has gone through five generations of development, with each
generation lasting from 10 to 20 years. Successive generations are overlapped in
about 10 years
• The high-technology community has argued for many years about the precise
definitions of centralized computing, parallel computing, distributed computing,
and cloud computing. In general, distributed computing is the opposite of
centralized computing.
• The field of parallel computing overlaps with distributed computing to a great
extent, and cloud computing overlaps with distributed, centralized, and
parallel computing
Centralized computing
This is a computing paradigm by which all computer resources are centralized
in one physical system. All resources (processors, memory, and storage) are fully
shared and tightly coupled within one integrated OS. Many data centers and
supercomputers are centralized systems, but they are used in parallel, distributed,
and cloud computing applications.
• One example of a centralized computing system is a traditional mainframe
system, where a central mainframe computer handles all processing and data
storage for the system. In this type of system, users access the mainframe
through terminals or other devices that are connected to it.
Parallel Computing
• It is the use of multiple processing elements simultaneously for solving any
problem. Problems are broken down into instructions and are solved concurrently
as each resource that has been applied to work is working at the same time.
• In parallel computing, all processors are either tightly coupled with centralized shared
memory or loosely coupled with distributed memory. Interprocessor communication is
accomplished through shared memory or via message passing.
Distributed Computing
• Distributed computing is the method of making multiple computers work together
to solve a common problem. It makes a computer network appear as a powerful
single computer that provides large-scale resources to deal with complex
challenges.
• A distributed system consists of multiple autonomous computers, each having its own
private memory, communicating through a computer network. Information exchange in a
distributed system is accomplished through message passing.
Cloud Computing
With the concept of scalable computing under our belt, it’s time to explore hardware,
software, and network technologies for distributed computing system design and
applications. We will focus on viable approaches to building distributed operating
systems for handling massive parallelism in a distributed environment.
Multicore CPUs and
Multithreading Technologies
Advances in CPU Processors - Today, advanced CPUs or microprocessor chips assume a
multicore architecture with dual, quad, six, or more processing cores.
• We see growth from 1 MIPS for the VAX 780 in 1978 to 1,800 MIPS for the Intel
Pentium 4 in 2002, up to a 22,000 MIPS peak for the Sun Niagara 2 in 2008.
• The clock rate for these processors increased from 10 MHz for the Intel 286 to 4
GHz for the Pentium 4 in 30 years.
• However, the clock rate reached its limit on CMOS-based chips due to power
limitations. At the time of this writing, very few CPU chips run with a clock rate
exceeding 5 GHz
Improvement
in processor
and network
technologies
over 33 years
Modern Multicore CPU Chip
• A multicore processor is an integrated circuit that has two or more processor cores
attached for enhanced performance and reduced power consumption. These
processors also enable more efficient simultaneous processing of multiple tasks,
such as with parallel processing and multithreading.
Hierarchy of
Caches
Caches are relatively small areas of very fast
memory. A cache retains often-used instructions or
data, making that content readily available to the
core without the need to access system memory. A
processor checks the cache first. If the required
content is present, the core takes that content from
the cache, enhancing performance benefits. If the
content is absent, the core will access system
memory for the required content. A Level 1, or L1,
cache is the smallest and fastest cache unique to
every core. A Level 2, or L2, cache is a larger
storage space shared among the cores.
Important Terminologies
Clock Speed - Clock speed refers to the rate at which a computer's central
processing unit (CPU) executes instructions. It's often measured in hertz (Hz) and
indicates how many cycles the CPU can complete per second. A higher clock speed
generally means faster processing.
Hyper-threading - Another approach involved the handling of multiple instruction
threads. Intel calls this hyper-threading. With hyper-threading, processor cores are
designed to handle two separate instruction threads at the same time
ILP, TLP and DSP
• Instruction Level Parallelism(ILP) – Instruction-Level Parallelism (ILP) refers to the
technique of executing multiple instructions simultaneously within a CPU core by keeping
different functional units busy for different parts of instructions. It enhances performance
without requiring changes to the base code, allowing for the overlapping execution of
multiple instructions.
• Thread Level Parallelism(TLP) - Thread Level Parallelism (TLP) refers to the ability of a
computer system to execute multiple threads simultaneously, improving the overall
efficiency and performance of applications. TLP is a form of parallel computing where
different threads of a program are run concurrently, often on multiple processors or cores.
• What is DSP processor architecture? - is described which achieves high processing
efficiency by executing concurrently four functions in every processor cycle: instruction
prefetching from a dedicated instruction memory and generation of an effective operand,
access to a single-port data memory and transfer of a data word over a common data bus,
arithmetic/logic-unit (ALU) operation, and multiplication
CPU and GPU
A Central processing unit (CPU) is commonly known as the brain of the computer. It is a
conventional or general processor used for a wide range of operations encompassing the
system instructions to the programs. CPUs are designed for high-performance serial processing
which implies they are well-suited for performing large amounts of sequential tasks.
The Graphics Processing Unit (GPU) is designed for parallel processing, and it uses
dedicated memory known as VRAM (Video RAM). They are designed to tackle thousands of
operations at once for tasks like rendering images, 3D rendering, processing video, and
running machine learning models. It has it’s own memory separate from the system’s RAM
which allows them to handle complex, high-throughput tasks like rendering and AI processing
efficiently.
CPU and
GPU
Architectu
re
Need for GPUs
• Both multi-core CPU and many-core GPU processors can handle multiple
instruction threads at different magnitudes today.
• Multicore CPUs may increase from the tens of cores to hundreds or more in the
future. But the CPU has reached its limit in terms of exploiting massive DLP due to
the aforementioned memory wall problem.
• This has triggered the development of many-core GPUs with hundreds or more thin
cores
Multithreading Technology
• Multithreading is a form of parallelization or dividing up work for simultaneous
processing. Instead of giving a large workload to a single core, threaded programs
split the work into multiple software threads. These threads are processed in
parallel by different CPU cores to save time.
Five micro
architecture
s in modern
CPU
processors
Explanation
• The superscalar processor is single-threaded with four functional units. Each of the three
multithreaded processors is four-way multithreaded over four functional data paths. In
the dual-core processor, assume two processing cores, each a single-threaded two-way
superscalar processor.
• Instructions from different threads are distinguished by specific shading patterns for
instructions from five independent threads.
• Fine-grain multithreading switches the execution of instructions from different threads
per cycle.
• Course-grain multithreading executes many instructions from the same thread for quite
a few cycles before switching to another thread.
• The multicore CMP executes instructions from different threads completely.
• The SMT allows simultaneous scheduling of instructions from different threads in the
same cycle.
• The blank squares correspond to no available instructions for an instruction data path at
a particular processor cycle. More blank cells imply lower scheduling efficiency.
GPU Computing
• A GPU is a graphics coprocessor or accelerator mounted on a computer’s graphics
card or video
card.
• A GPU offloads the CPU from tedious graphics tasks in video editing applications.
• The world’s first GPU, the GeForce 256, was marketed by NVIDIA in 1999. These GPU
chips can process a minimum of 10 million polygons per second, and are used in
nearly every computer on the market today.
• Unlike CPUs, GPUs have a throughput architecture that exploits massive parallelism
by executing many concurrent threads slowly, instead of executing a single long
thread in a conventional microprocessor very quickly
Working of GPU
• Modern GPUs are not restricted to accelerated graphics or video coding. They are used in
HPC systems to power supercomputers with massive parallelism at multicore and
multithreading levels. GPUs are designed to handle large numbers of floating-point
operations in parallel.
• In a way, the GPU offloads the CPU from all data-intensive calculations, not just those that
are related to video processing. Conventional GPUs are widely used in mobile phones,
game consoles, embedded systems, PCs, and servers. The NVIDIA CUDA Tesla or Fermi is
used in GPU clusters or in HPC systems for parallel processing of massive floating-
pointing data..
• The GPU has a many-core architecture that has hundreds of simple processing cores
organized as multiprocessors. Each core can have one or more threads.
• The CPU instructs the GPU to perform massive data processing. The bandwidth must be
matched between the on-board main memory and the on-chip GPU memory. This process
is carried out in NVIDIA’s CUDA programming using the GeForce 8800 or Tesla and Fermi
GPUs.
The use of a GPU along
with a CPU for massively
parallel execution
Example 1.1
The NVIDIA
Fermi GPU
Chip with 512
CUDA Cores
NVIDIA
Present Day
- NVIDIA
A100
TENSOR
CORE GPU
Power Efficiency of the GPU
• Bill Dally of Stanford University considers power and massive parallelism as the
major benefits of GPUs over CPUs for the future.
• By extrapolating current technology and computer architecture, it was estimated
that 60 Gflops/watt per core is needed to run an exaflops system.
• FLOP or floating point operations per second is a measure of performance,
meaning how fast the computer can perform calculations. GFLOP is simply a Giga
FLOP. So having GPU with 2 times higher GFLOP value is very likely to speed up the
training process.
• Today's massively parallel supercomputers are measured in teraflops (Tflops: 1012
flops)
GPU
Performan
ce
Memory, Storage, and Wide-Area
Networking
• Memory Wall Problem - The memory wall refers to the increasing gap between
processor speed and memory bandwidth, where the rate of improvement in
processor performance outpaces the rate of improvement in memory performance
due to limited I/O and decreasing signal integrity.
Memory
Technolo
gy
Memory and Storage
• The rapid growth of flash memory and solid-state drives (SSDs) also impacts the
future of HPC and HTC systems. The mortality rate of SSD is not bad at all.
• For hard drives, capacity increased from 260 MB in 1981 to 250 GB in 2004.
• A typical SSD can handle 300,000 to 1 million write cycles per block.
• Eventually, power consumption, cooling, and packaging will limit large system
development. Power increases linearly with respect to clock frequency and
quadratic ally with respect to voltage applied on chips.
Seagate – Present Day
System-Area Interconnects
• The nodes in small clusters are mostly interconnected by an Ethernet switch or a
local area network (LAN).
• LAN typically is used to connect client hosts to big servers.
• A storage area network (SAN) connects servers to network storage such as disk
arrays. Network attached storage (NAS) connects client hosts directly to the disk
arrays.
• All three types of networks often appear in a large cluster built with commercial
network components. If no large distributed storage is shared; a small cluster
could be built with a multiport Gigabit Ethernet switch plus copper cables to link
the end machines.
Virtual Machines and Virtualization Middleware
Example – Salesforce,
SaaS
Remote Method Invocation( RMI). It is a mechanism that allows an object residing in one system (JVM) to
access/invoke an object running on another JVM. RMI is used to build distributed applications; it provides
remote communication between Java programs. It is provided in the package java. rmi.
IIOP (Internet Inter-ORB Protocol) is a protocol that makes it feasible for distributed
applications written in various programming languages to interact over the Internet. IIOP is a
vital aspect of a major industry standard, the Common Object Request Broker Architecture
(CORBA) (CORBA)
Evolution of SOA- Service oriented architecture
Therefore Speedup =
Pe/Pw or Speedup =
Ew/Ee
Amdahl’s Law
The formula for Amdahl’s law is:
S = 1 / (1 – P + (P / N))
Where:
S is the speedup of the system
P is the proportion of the system that can be improved
N is the number of processors in the system
For example, if a system has a single bottleneck that occupies 20% of the total execution time, and
we add 4 more processors to the system, the speedup would be:
HA (high availability) is desired in all clusters, grids, P2P networks, and cloud
systems. A system is highly available if it has a long mean time to failure (MTTF) and
a short mean time to repair (MTTR). System availability is formally defined as follows: