Unit VI Parallel Programming Concepts
Unit VI Parallel Programming Concepts
3
It can take advantage of non-local resources when the local
resources are finite.
Serial Computing ‘wastes’ the potential computing
power, thus Parallel Computing makes better work of hardware.
Types of Parallelism
Parallelism in Hardware (Uniprocessor)
▪ Parallelism in a Uniprocessor
– Pipelining
– Superscalar, VLIW etc.
▪ SIMD instructions, Vector processors, GPUs
▪ Multiprocessor 3
Example
for (i=1; i<=100; i= i+1)
y[i] = y[i] + x[i];
[0] . . . [N/2 − 1] and while thread B, running on core 1, could sum the elements [N/2] . .
. [N − 1]. So the Two threads would be running in parallel on separate computing cores
Bit-level parallelism
Bit-level parallelism is a form of parallel computing based on
increasing processor word size, depending on very-large-scale
integration (VLSI) technology.
Enhancements in computers designs were done by
3
Developing parallel hardware and software has traditionally been time and effort
intensive.
If one is to view this in the context of rapidly improving uniprocessor speeds, one
is tempted to question the need for parallel computing.
The Computational Speed Argument: For some applications, this is the only means
of achieving needed performance.
The Memory/Disk Speed Argument: For some other applications, the needed I/O
throughput can be provided only by a collection of nodes.
7
The Data Communication Argument: In yet other applications, the distributed
nature of data implies that it is unreasonable to collect data to process it at a single
location.
In short, motivation of parallel computing are:
Scientific (research)
12
Parallel Programming Platforms
The traditional logical view of a sequential computer consists of a
memory connected to a processor via a datapath. All three components
– processor, memory, and datapath – present bottlenecks to the
overall processing rate of a computer system
The main objective is to provide sufficient details to programmer to
12
Pipelining
The process of fetching next instruction when current instruction is
being executed by processor
14
These operations are put into a very long instruction word
which the processor can then take apart without further analysis,
handing each operation to an appropriate functional unit.
VLIW Processor
VLIW is sometimes viewed as the next step beyond the reduced instruction set
computing ( RISC ) architecture, which also works with a limited set of relatively
basic instructions and can usually execute more than one instruction at a time (a
characteristic referred to as superscalar ).
14
VLIW Architecture
14
VLIW Processor
Advantages of VLIW architecture
Increased performance.
Potentially scalable i.e. more execution units can be added and so more instructions
can be packed into the VLIW instruction.
• The former is sometimes also referred to as the control structure and the latter as the
communication model
Control Structure of Parallel Platforms
Parallel tasks can be specified at various levels of granularity. At the other extreme, individual instructions
within a program can be viewed as parallel tasks. Between these extremes lie a range of models for
specifying the control structure of programs and the corresponding architectural support for them.
Parallelism from single instruction on multiple processors
Consider the following code segment that adds two vectors:
16
1. for (i = 0; i < 1000; i++)
2 c[i] = a[i] + b[i];
In this example, various iterations of the loop are independent of each other; i.e.,
c[0] = a[0] + b[0]; c[1] = a[1] + b[1];, etc., can all be executedindependently of each other.
Consequently, if there is a mechanism for executing the same instruction, in this case add on all
the processors with appropriate data, we could execute this loop much faster
Definitions
Computation / Communication Ratio:
In parallel computing, granularity is a qualitative measure of the ratio of
computation to commu–nication.
– Periods of computation are typically separated from periods of communication by
synchronization events.
Fine grain parallelism
Coarse grain parallelism
Fine-grain Parallelism
• Relatively small amounts of computational work
are done between communication events
• Low computation to communication ratio
• Facilitates load balancing
• Implies high communication overhead and less
opportunity for performance enhancement
• If granularity is too fine it is possible that the
overhead required for communications and
synchronization between tasks takes longer than
the computation.
Coarse-grain Parallelism
Relatively large amounts of
computational work are done between
communication/synchronization events
High computation to communication
ratio
Implies more opportunity for
performance increase
Harder to load balance efficiently
A typical SIMD architecture (a) and a typical MIMD architecture (b).
17
Figure A typical SIMD architecture (a) and a typical MIMD architecture (b)
Executing a conditional statement on an SIMD computer with four processors: (a) the conditional statement;
(b) the execution of the statement in two steps
18
Communication Model of Parallel Platforms
Shared-Address-Space Platforms
Typical shared-address-space architectures:
(a)Uniform-memory-access (UMA) shared-address-space computer;.
In thismodel, all the processors share the physical memory uniformly.
All the processors have equal access time to all the memory words.
Each processor may have a private cache memory. Same rule is followed for peripheral
devices.
When all the processors have equal access to all the peripheral devices, the system
19
is called a symmetric multiprocessor
When only one or a few processors canaccess the peripheral devices, the
system is called an asymmetric multiprocessor.
Communication Model of Parallel Platforms
Shared-Address-Space Platforms
19
Communication Model of Parallel Platforms
Shared-Address-Space Platforms
Uniform-memory-access(UMA)
shared- address-space computer with
caches and memories
19
Communication Model of Parallel Platforms
Shared-Address-Space Platforms
Non-uniform- memory-access (NUMA)
shared-address-space computer with local memory only.
19
Communication Model of Parallel Platforms
Shared-Address-Space Platforms
Cache Only - memory-access (COMA)
The COMA model is a special case of the NUMA model. Here, all the distributed main
memories are converted to cache memories.
19
Physical Organization
of Parallel Platforms
Parallel Random Access Machines (PRAM) is a model, which is considered for most of the
parallel algorithms. Here, multiple processors are attached to a single block of memory.
All the processors share a common memory unit. Processors can communicate among
themselves through the shared memory only.
A memory access unit (MAU) connects the processors with the single shared memory.
PRAM :
21
Concurrent-read, exclusive-write (CREW) PRAM. In this class, multiple read accesses to a memory
location are allowed. But multiple write are not allowed(e,g, websites, blogs) 21
Exclusive-read, concurrent-write (ERCW) PRAM. Multiple write accesses are allowed to a memory
location, but multiple read accesses are serialized.(e.g. Devepoler with DBA)
Concurrent-read, concurrent-write (CRCW) PRAM. This class allows multiple read and write accesses
to a common memory location. This is the most powerful PRAM model.(e.g. Cloud Services)
There are many methods to implement the
PRAM model,
Distributed Memory
model
1. Shared Memory Model
20
Message-Passing Platforms
In its most general form, message-passing paradigms support execution of a different program
on each of the p nodes.
2. Message Passing Model
20
20
Distributed Memory
Processors have their own local memory. Memory addresses in one processor do not map
to another processor, so there is no concept of global address space across all processors.
Distributed memory systems require a communication network to connect inter-processor
memory.
Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
When a processor needs access to data in another processor, it is usually the task of the
programmer to explicitly define how and when data is communicated.
Synchronization between tasks is likewise the programmer's responsibility.
The network "fabric" used for data transfer varies widely, though it can can be as simple
as Ethernet.
Distributed Memory
Interconnection Networks for Parallel Computers
Dynamic networks are built using switches and communication links. Dynamic
networks are also referred to as indirect networks.
Interconnection Networks for Parallel Computers
Interconnection networks can be classified as static or dynamic.
Static networks consist of point- to-point communication links among processing nodes and are also referred to as
direct networks.
Figure .Classification of interconnection networks: (a) a static network; and (b) a dynamic
network.
22
Network Topology
Static Network consist linear array, Ring, Tree, Star, Mesh , Hypercube etc
Dynamic Network consist Buses, Crossbar switch, Mesh network, Multistage network
etc
22
Linear Arrays
Linear arrays: (a) with no wrap around links; (b) with wraparound link.
• Tree-Based Networks : In this topology one path is used between any pair of
nodes.
• Static and dynamic tree
• Static tree: Each node of the tree are processing elements
• Dynamic tree: Intermediate nodes are switching nodes
26
Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.
A mesh is a network topology in which processing elements are arranged in a grid.
The rows and column positions are used to denote a particular processor in the
mesh network.
Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D
mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound.
24
Construction of hypercubes from hypercubes of lower dimension.
25
N-wide
superscalar
architectur
e
PROF. ANAND GHARU
Base
Scalar
• It is defined as a machine with one instruction issued per cycle.
Processor:
٣
What does Superscalar mean?
• Common instructions (arithmetic, load/store, conditional branch) can be
initiated and executed independently in separate pipelines
—Instructions are not necessarily executed in the order in which
they appear in a program
—Processor attempts to find instructions that can be executed
independently, even if they are out-of-order
—Use additional registers and register renaming to eliminate some
dependencies
• Equally applicable to RISC & CISC
• Quickly adopted and now standard approach for high-
performance microprocessors
A 5-stage Pipeline
Memory General
registers
IF ID
▪ Superscalar: several instructions are simultaneously fetched at the same stages of their
execution
Functional Unit
Functional Unit
Multi – core
Processors
Introduction: What is Processor?
• 3D Gaming
• Database servers
• Multimedia applications
• Video editing
• Powerful graphics solution
• Encoding
• Computer Aided Design (CAD)