Unit III
Unit III
Unit III
Parallel processing challenges – Flynn’s classification – SISD, MIMD, SIMD, SPMD, and Vector
Architectures - Hardware multithreading – Multi-core processors and other Shared Memory
Multiprocessors - Introduction to Graphics Processing Units, Clusters, Warehouse Scale Computers
and other Message-Passing Multiprocessors
INTRODUCTION
To fulfill increasing demands for higher performance, it is necessary to process data concurrently to achieve better
throughput instead of processing each instruction sequentially as in a conventional computer. Processing data
concurrently is known as parallel processing. There are two ways by which we can achieve parallelism. They are:
• Multiple Functional Units - System may have two or more ALUs so that they can execute two or more
instructions at the same time.
• Multiple Processors - System may have two or more processors operating concurrently.
There are several different forms of parallel computing: bit-level, instruction-level, data-level and task-level
parallelism.
Multiprocessors
A computer system with at least two or more processors is called multiprocessor system. The multiprocessor
software must be designed to work with a variable number of processors.
Features of Multiprocessor System:
o Better Performance
o Scalability
o Improve Availability / Reliability
o High Throughput
o Job-Level Parallelism/ Process-Level Parallelism
o Parallel Processing Program
Clusters
A set of computers connected over a local area network that function as a single large multiprocessor is called
a cluster.
Multicore Multiprocessors
A multicore is an architecture design that places multiple processors on a single die (computer chip) to
enhance performance and allow simultaneous processing of multiple tasks more efficiently. Each processor is called
a core.
ILP is a measure of how many operations in a computer program can be performed simultaneously. The potential
overlap among instructions is called instruction level parallelism. It is a technique which is used to overlap the
execution of instructions to improve performance. Pipelining is a technique that runs programs faster by overlapping
the execution of instructions. Pipelining is an example of instruction level parallelism.
Two methods of ILP
o Increasing the depth of pipeline
By increasing the depth of the pipeline, more instructions can be executed in parallel
simultaneously. The amount of parallelism being exploited is higher, since there are more operations
being overlapped.
o Multiple Issue
Multiple issue is a technique which replicates the internal components of the computer so
that it can launch multiple instructions in every pipeline stage. Launching multiple instructions per
stage will allow the instruction execution rate to exceed the clock rate or the CPI to be less than 1.
Types of Multiple issues
There are two major ways to implement a multiple issue processor such as,
• Static multiple Issues – It is an approach to implement a multiple issue processor where many decisions are
made statically by the compiler before execution.
• Dynamic Multiple Issues – It is an approach to implement a multiple issue processor where many decisions
are made during execution by the processor.
The Concept of Speculation
Speculation is an approach that allows the compiler or the processor to ‘guess’ about the properties of an
instruction, so as to enable execution to begin for other instructions that may depend on the speculated instruction.
Types of Speculation
1. Compiler based Speculation
The compiler can use speculation to reorder instructions, moving an instruction across a branch or a load
across a store. In compiler-based speculation, exception problems are avoided by adding special speculation support
that allows such exceptions to be ignored until it is clear that they really should occur.
Recovery Mechanism
In the case of speculation in software, the compiler usually inserts additional instructions that check
the accuracy of the speculation and provide a fix-up routine to use when the speculation is incorrect.
2. Hardware-based Speculation
The processor hardware can perform the same transformation i.e., reordering instructions at runtime. In
hardware-based speculation, exceptions are simply buffered until it is clear that the instruction causing them is no
longer speculative and is ready to complete; at that point the exception is raised, and normal execution handling
proceeds.
The scheduled code as it would look on a two-issue MIPS pipeline. The empty slots are ‘nop’.
Advantages
2. Loop Unrolling
An important compiler technique to get more performance from loops is loop unrolling, where
multiple copies of the loop body are made. After unrolling, there is more ILP available by overlapping the
instructions from different iterations.
Loop unrolling is a technique to get more performance from loops that access arrays, in which
multiple copies of the loop body are made and instructions from different iterations are scheduled together.
Example
Loop: lw $t0, 0($s1);
addu $t0, $t0, $s2;
sw $t0, 0($s1);
addi $s1, $s1, -4;
bne $s1, $zero, Loop;
Let us see how well loop unrolling and scheduling work well in the above example. For simplicity assume that the
loop index is a multiple of 4.
To schedule the loop without any delays, it turns out that we need to make 4 copies of the loop body. After
unrolling and eliminating the unnecessary loop overhead instructions, the loop will contain four copies of lw, addu,
sw, addi and bne.
During the unrolling process, the compiler introduced additional registers ($t1, $t2, $t3). The goal of this
process, called register renaming, is to eliminate dependences that are not true data dependences, but could either lead
to potential hazards or prevent the compiler from flexibly scheduling the code.
Register Renaming
It is the process of renaming the registers by the compiler or hardware to remove antidependences.
Consider how the unrolled code would look using only $t0. There would be repeated instances of lw $t0,
0($s1), addu $t0, $t0, $s2 followed by sw $t0, 4($s1), but these sequences, despite using $t0, are actually completely
independent – no data values flows between one pair of these instructions and the next pair. This is what is called an
antidependences or name dependence, which is an ordering forced purely by the reuse of a name, rather than a real
data dependence which is also called a true dependence.
Reorder Buffer
Speedup =
Amdahl’s law gives us a quick way to find the speed up from two factors: Fractionenhanced (Fe) and Speedupenhanced (Se).
It is given as
Therefore, Speedup =
Speedup =
Fractionenhanced (Fe)
It is the fraction of the computation time in the original machine that can be converted to take advantage of
the enhancement.
Speedupenhanced (Se)
It tells how much faster the task would run if the enhancement mode was used for the entire program.
Problems related to Amdahl’s Law:
1. Suppose you want to achieve a speed-up of 80 times fester with 100 processors. What percentage of the
original computation can be sequential?
Solution:
Given data’s, Speedup = 80, Speed Enhanced = Se = 100, Fe = ?
Amdahl’s law says that,
We can reformulate Amdahl’s law in terms of speed-up versus the original execution time:
This formula is usually rewritten assuming that the execution time before is 1 for some unit of time, and the
execution time affected by improvement is considered the fraction of the original execution time:
80 =
Thus to achieve a speedup of 80 from 100 processors, the sequential percentage can only be 0.3%.
1. Speed-up Challenge: Bigger Problem (Increase in Problem Size)
Suppose you want to perform two sums: one is a sum of 10 scalar variables and one is a matrix sum of a pair
of two-dimensional arrays, with dimensions 10 by 10.For now let‘s assume only the matrix sum is
parallelizable. What speed-up do you get with 10 versus 40 Processors? Next, calculate the speed-ups assuming
the matrices grow to 20 by 20.
If we assume performance is a function of the time for an addition, t, then there are 10 additions that do not
benefit from parallel processors and 100 additions that do. If the time for a single processor is 110 t, the execution
time for 10 processors is
Execution time affected by improvement
Execution time after improvement = + Execution time unaffected
Amount of improvement
5.5
Potential speedup with 10 processors = x100 = 55%
10
8.8
Potential speedup with 40 processors = x100 = 22%
40
These examples show that getting good speed-up on a multiprocessor while keeping the problem size fixed is harder
than getting good speed-up by increasing the size of the problem.
(ii) Matrices grow by 20x20
410t
Speedup of 10 processor = = 5.5
50t
Execution time for 40 processors
400t
Execution time after improvement = + 10t = 20t
40
410t
Speedup of 40 processor = = 20.5
20t
8.2
Potential speedup with 10 processors = x100 = 82%
10
20.5
Potential speedup with 40 processors = x100 = 51%
40
Th us, for this larger problem size, we get 82% of the potential speed-up with 10 processors and 51% with 40.
Conclusion:
This examples show that getting good speed-up on a multiprocessor while keeping the problem size fixed is
harder than getting good speed-up by increasing the size of the problem.
This allows us to introduce two terms that describe ways to scale up.
1. Strong Scaling – Speedup achieved on a multiprocessor without increasing the size of the problem.
2. Weak Scaling – Speedup achieved on a multiprocessor while increasing the size of the problem
proportionally to the increase in the number of processors.
3. Speedup Challenge: Balancing Load
FLYNN’S CLASSIFICATION
Parallel processing can be classified in many ways. It can be classified according to the internal organization of
processors, according to the interconnection structure used between processors or according to the flow of information
through the system.
One such classification is introduced by Micheal J. Flynn. We know that a typical processing unit operates by
fetching instructions and operands from the main memory, executing the instructions, and placing the results in the
main memory. The steps associated with the processing of an instruction form an instruction cycle. The instruction
can be viewed as forming an instruction stream flowing from main memory to the processor, while the operands form
another stream, data stream, flowing to and from the processor.
Instruction Stream
Processor
(P) Memory
(M)
Prepared by Ms.K.Sherin, AP/CSE/SJIT
Data Stream
In 1996, Micheal J. Flynn has made an informal and widely used classification of processor parallelism based on
the number of simultaneous instruction and data streams seen by the processor during program execution.
The classification made by Micheal J. Flynn divides computers into four major groups:
• Single Instruction Stream – Single Data Stream (SISD)
• Single Instruction Stream – Multiple Data Stream (SIMD)
• Multiple Instruction Stream – Single Data Stream (MISD)
• Multiple Instruction Stream – Multiple Data Stream (MIMD)
Categorization based on No. of instruction streams & No. of Data streams
The following classification was based on the number of instruction streams and the number of data streams.
Thus, a conventional Uniprocessor has a single instruction stream and single data stream, and a conventional
multiprocessor has multiple instruction streams and multiple data streams.
An application is data parallel if it wants to do the same computation on lots of pieces of data, which typically
come from different squares in a grid. Examples include image processing, weather forecasting, and computational
fluid dynamics (e.g. simulating airflow around a car or inside a jet engine).
Providing more than one arithmetic logic unit (ALU) that can all operate in parallel on different inputs,
providing the same operation, is an example of SIMD. This can be achieved by using multiple input buses in the CPU
for each ALU that load data from multiple registers. The processor's control unit sends the same command to each of
the ALUs to process the data and the results may be stored, again using multiple output buses. Machines that provide
vector operations are classified as SIMD. In this case a single instruction is simultaneously applied to a vector.
Advantages of SIMD
• Reduces the cost of control unit over dozens of execution units.
• It has reduced instruction bandwidth and program memory.
• It needs only one copy of the code that is being executed simultaneously.
• SIMD works best when dealing with arrays in ‘for’ loops. Hence, for parallelism to work in SIMD, there
must be a great deal of identically structured data, which is called data-level parallelism.
Disadvantages of SIMD
• SIMD is at its weakest in case or switch statements, where each execution unit must perform a different
operation on its data, depending on what data it has.
• Execution units with the wrong data are disabled, so that units with proper data may continue. Such situation
essentially run at 1/nth performance, where ‘n’ is the number of cases.
It easily captures the flexibility in data It also easily captures the flexibility in
5
widths data widths
6 It is easier to evolve over time It is complex to evolve over time.
HARDWARE MULTITHREADING
Multithreading
Multithreading is a higher-level parallelism called thread-level parallelism (TLP) because it is logically
structured as separate threads of execution.
When pipelining is used, it is essential to maximize the utilization of each pipeline stage to improve
throughput. It can be accomplished by executing some instructions in a different order rather than executing them
sequentially as they occur in the instruction stream and initiating execution of some instructions even though it is not
Applications
Multi-core processors are widely used across many application domains including
• General-purpose
• Embedded
• Network
• Digital Signal Processing (DSP)
• Graphics
Although the latency is uniform, it may be large for a network that connects many processors and memory modules.
For better performance, it is desirable to place a memory module close to each processor. The result is a collection of
nodes, each consisting of a processor and a memory module.
The increasing demands of processing for computer graphics has led to the development of specialized chips
called graphics processing units (GPUs). The primary purpose of GPUs is to accelerate the large number of floating-
point calculations needed in high-resolution three-dimensional graphics, such as in video games. Since the operations
involved in these calculations are often independent, a large GPU chip contains hundreds of simple cores with floating-
point ALUs to perform them in parallel.
• An example is the Compute Unified Device Architecture (CUDA) that NVIDIA Corporation uses for the
cores in its GPU chips.
Key characteristics as to how GPUs vary from CPUs
GPUs are accelerators that supplement a CPU, so they do not need be able to perform all the tasks of a CPU.
This role allows them to dedicate all their resources to graphics. It's fine for GPUs to perform some tasks poorly or not
at all, given that in a system with both a CPU and a GPU, the CPU can do them if needed.The GPU problems sizes
are typically hundreds of megabytes to gigabytes, but not hundreds of gigabytes to terabytes.
These differences led to different styles of architecture:
• The biggest difference is that GPUs do not rely on multilevel caches to overcome the long latency to memory, as
do CPUs. Instead, GPUs rely on hardware multithreading to hide the latency to memory.
• The GPU memory is oriented toward bandwidth rather than latency. There are even special graphics DRAM
chips for GPUs that are wider and have higher bandwidth than DRAM chips for CPUs. In addition, GPU memories
have traditionally had smaller main memories than conventional microprocessors
• Given the reliance on many threads to deliver good memory bandwidth, GPUs can accommodate many parallel
processors (MIMD) as well as many threads. Hence, each GPU processor is more highly multithreaded than a
typical CPU, plus they have more processors.
An Introduction to the NVIDIA GPU Architecture
Figure shows a simplifi ed block diagram of a multithreaded SIMD processor. Dropping down one more level of
detail, the machine object that the hardware creates, manages, schedules, and executes is a thread of SIMD
instructions, which also call a SIMD thread. It is a traditional thread, but it contains exclusively SIMD instructions.
These SIMD threads have their own program counters and they run on a multithreaded SIMD processor.
GPU Memory is shared by the vectorized loops. All threads of SIMD instructions within a thread block share Local
Memory. The Figure shows the memory structures of an NVIDIA GPU.
• On- chip memory that is local to each multithreaded SIMD processor Local Memory. It is shared by the SIMD
Lanes within a multithreaded SIMD processor, but this memory is not shared between multithreaded SIMD
processors.
• Off- chip DRAM shared by the whole GPU and all thread blocks GPU Memory. Rather than rely on large caches
to contain the whole working sets of an application, GPUs traditionally use smaller streaming caches and rely
on extensive multithreading of threads of SIMD instructions to hide the long latency to DRAM, since their working
sets can be hundreds of megabytes.
CLUSTERS
Clusters are collections of computers that are connected to each other using their I/O interconnect via standard
network switches and cables to form a message-passing multiprocessor. Each runs a distinct copy of the operating
system. Virtually every internet service relies on clusters of commodity servers and switches.
Drawbacks of cluster
o Administration cost – The cost of administering a cluster of n machines is about the same as the cost of
administering n independent machines, while the cost of administering a shared memory multiprocessor
with n processors is about the same as administering a single machine.
o Performance degradation – The processors in a cluster are usually connected using the I/O interconnect
of each computer; whereas the cores in a multiprocessor are usually connected on the memory
interconnect of the computer. The memory interconnect has higher bandwidth and lower latency,
allowing much better communication performance.
o Division of memory – A cluster of n machines has n independent memories and n copies of the operating
system, but a shared memory multiprocessor allows a single program to use almost all the memory in the
computer, and it only needs a single copy of the OS.
Advantages of Clusters
1. High availability – Since a cluster consists of independent computers connected through a local area network,
it is much easier to replace a machine without bringing down the system in cluster than in an SMP.
2. Scalable – Given that clusters are constructed from whole computers and independent, scalable networks, this
isolation also makes it easier to expand the system without bringing down the application that runs on top of
the cluster.
3. Low cost
4. Improve power efficiency – Clusters consumes less power and works efficiently.
Examples
The search engines that millions of us use every day depend upon this technology. eBay, Google,
Microsoft, Yahoo, and others all have multiple datacenters each with clusters of tens of thousands of processors.
Message passing
Message passing is nothing but communication between multiple processors by explicitly sending and
receiving information.
Send Message Routine
A routine used by a processor in machines with private memories to pass to another processor.
Receive Message Routine
A routine used by a processor in machines with private memories to accept a message from another processor.
Some concurrent applications run well on parallel hardware, independent of whether it offers shared addresses
or message passing. In particular, job-level parallelism and applications with little communication – like web search,
mail servers, and file servers – do not require shared addressing to run well.
Advantages
There were several attempts to build high-performance computers based on high-performance message-
passing networks, and they did offer better absolute communication performance than clusters built using local area
networks.
Disadvantages
The problem was that they were much more expensive. Few applications could justify the higher
communication performance, given the much higher costs.
WAREHOUSE-SCALE COMPUTERS
Warehouse-scale computers (WSCs) form the foundation of internet services that people use for search, social
networking, online maps, video sharing, online shopping, email, cloud computing, etc. The ever increasing popularity
of internet services has necessitated the creation of WSCs in order to keep up with the growing demands of the public.