Parallel Computing
Parallel Computing
PARALLEL
COMPUTING
Prepared by
Dr M.MANIMARAN
Senior Associate Professor (Grade-1)
School of Computing Science and Engineering
VIT Bhopal University
OBJECTIVES:
To introduce you to the basic concepts and
ideas in parallel computing
To familiarize you with the major
programming models in parallel computing
To provide you with with guidance for
designing efficient parallel programs
2
OUTLINE:
❑ Parallel examples
3
What is High Performance Computing?
5
What is High Performance Computing?
6
Parallel Computing:
In the simplest sense, parallel computing is the simultaneous use of multiple
compute resources to solve a computational problem:
7
Parallel
Computers:
Virtually all stand-alone computers today are parallel from a
hardware perspective:
8
Parallel
Computers:
Networks connect multiple stand-alone computers (nodes) to create
larger parallel computer clusters
9
Why Use HPC?
Major reasons:
Save time and/or money: In theory, throwing more resources at a
task will shorten its time to completion, with potential cost
savings. Parallel clusters can be built from cheap, commodity
components.
10
Why Use HPC?
Major reasons:
Save time and/or money: In theory, throwing more resources at a
task will shorten its time to completion, with potential cost
savings. Parallel clusters can be built from cheap, commodity
components.
10
Why Use HPC?
Major reasons:
Save time and/or money: In theory, throwing more resources at a
task will shorten its time to completion, with potential cost
savings. Parallel clusters can be built from cheap, commodity
components.
10
Why Use HPC?
Major reasons:
Save time and/or money: In theory, throwing more resources at a
task will shorten its time to completion, with potential cost
savings. Parallel clusters can be built from cheap, commodity
components.
Source:
Top500.org
1
Future
Trends:
Source:
Top500.org
12
Parallel Computer Memory
Architectures:
Shared Memory:
❑ Multiple processors can operate
independently, but share the same memory
resources
❑ Changes in a memory location caused by one
CPU are visible to all processors
13
Parallel Computer Memory
Architectures:
Shared Memory:
❑ Multiple processors can operate
independently, but share the same memory
resources
❑ Changes in a memory location caused by one
CPU are visible to all processors
Advantages:
❑ Global address space provides a user-friendly programming perspective to memory
❑ Fast and uniform data sharing due to proximity of memory to CPUs
Disadvantages:
❑ Lack of scalability between memory and CPUs. Adding more CPUs increases
traffic on the shared memory-CPU path
❑ Programmer responsibility for “correct” access to global memory
13
Parallel Computer Memory
Architectures:
Distributed Memory:
❑ Requires a communication network to
connect inter-processor memory
❑ Processors have their own local memory.
Changes made by one CPU have no effect on
others
❑ Requires communication to exchange data
among processors
14
Parallel Computer Memory
Architectures:
Distributed Memory:
❑ Requires a communication network to
connect inter-processor memory
❑ Processors have their own local memory.
Changes made by one CPU have no effect on
others
❑ Requires communication to exchange data
among processors
Advantages:
❑ Memory is scalable with the number of CPUs
❑ Each CPU can rapidly access its own memory without overhead incurred with trying to
maintain global cache coherency
Disadvantages:
❑ Programmer is responsible for many of the details associated with data communication
between processors
❑ It is usually difficult to map existing data structures to this memory organization,
based on global memory 14
Parallel Computer Memory
Architectures:
Hybrid Distributed-Shared Memory:
The largest and fastest computers in the world today employ both shared
and distributed memory architectures.
15
Parallel Computer Memory
Architectures:
Hybrid Distributed-Shared Memory:
The largest and fastest computers in the world today employ both shared
and distributed memory architectures.
❑ Data Parallel
❑ Hybrid
16
Shared Threads Models:
POSIX Threads
OpenMP
17
Distributed Memory / Message Passing
Models:
❑ A set of tasks that use their own local memory during computation.
Multiple tasks can reside on the same physical machine and/or across
an arbitrary number of machines
❑ Message Passing Interface (MPI) is the "de facto" industry standard for
message passing, replacing virtually all other message passing
implementations used for production work. MPI implementations exist for
virtually all popular parallel computing platforms
18
Data Parallel
Model:
❑ May also referred to as the Partitioned Global Address Space (PGAS)
model
❑ It displays these characteristics:
▪ Address space is treated globally
▪ Parallel work focuses on performing operations on a data set
▪ Tasks work on different portions from the same data structure
▪ Tasks perform the same operation
19
Data Parallel
Model:
❑ May also referred to as the Partitioned Global Address Space (PGAS) model
❑ It displays these characteristics:
▪ Address space is treated globally
▪ Parallel work focuses on performing operations on a data set
▪ Tasks work on different portions from the same data structure
▪ Tasks perform the same operation
Example Implementations:
19
Hybrid Parallel Programming
Models:
This hybrid model lends itself well to the increasingly common hardware environment
of clustered multi/many-core machines
20
Hybrid Parallel Programming
Models:
Another similar and increasingly popular example of a hybrid model is using MPI with GPU
(Graphics Processing Unit) programming
❑ Communications between processes on different nodes occurs over the network using
MPI
21
Languages using parallel
❑computing:
C/C++
❑ Fortran
❑ MATLAB
❑ Python
❑ R
❑ Perl
❑ Julia
❑ And others
22
Can my code be
parallelized?
❑ Does it have large loops that repeat the same
operations?
▪ Would the amount of time it take to parallelize your code be worth the
gain in speed?
▪ Start from scratch: Takes longer, but will give better performance,
accuracy, and gives the opportunity to turn a “black box” into a code
you understand
24
Basic guidance for efficient
parallelization:
❑ Increase the fraction of your program that can be parallelized. Identify
the most time consuming parts of your program and parallelize them.
This could require modifying your intrinsic algorithm and code’s
organization
25
Considerations about parallelization:
You parallelize your program to run faster, and to solve larger and
more complex problems.
26
Oversimplified
example:
p fraction of program that can be parallelized
1 - p fraction of program that cannot be parallelized
n number of processors
27
Oversimplified example,
cont’d: 80%
20%
Serial
20% 20%
Parallel
Process 1
parallelized
Process 2
Not parallelized
Process 3
Process 4
28
More realistic
example: 80%
20%
Serial
20% 20%
Parallel
Process 1
parallelized
Process 2
Not parallelized
Communication
Process 3
overhead
Process 4
Load unbalance 29
Realistic example: Speedup of matrix vector multiplication
in large scale shell-model calculations
30
Designing parallel programs -
partitioning:
One of the first steps in designing a parallel program is to break the problem into
discrete “chunks” that can be distributed to multiple parallel tasks.
31
Designing parallel programs -
partitioning:
One of the first steps in designing a parallel program is to break the problem into
discrete “chunks” that can be distributed to multiple parallel tasks.
Domain Decomposition:
Data associate with a problem is
partitioned – each parallel task works
on a portion of the data
31
Designing parallel programs -
partitioning:
One of the first steps in designing a parallel program is to break the problem into
discrete “chunks” that can be distributed to multiple parallel tasks.
Domain Decomposition:
Data associate with a problem is
partitioned – each parallel task works
on a portion of the data
31
Designing parallel programs -
partitioning:
One of the first steps in designing a parallel program is to break the problem into
discrete “chunks” that can be distributed to multiple parallel tasks.
Functional Decomposition:
Problem is decomposed according to the work that must be done. Each parallel
task performs a fraction of the total computation.
32
Designing parallel programs -
communication:
Most parallel applications require tasks to share data with each
other.
33
Designing parallel programs -
communication:
Most parallel applications require tasks to share data with each other.
33
Designing parallel programs -
communication:
Most parallel applications require tasks to share data with each other.
Latency vs. Bandwidth: Latency is the time it takes to send a minimal message between
two tasks. Bandwidth is the amount of data that can be communicated per unit of time.
Sending many small messages can cause latency to dominate communication overhead.
33
Designing parallel programs -
communication:
Most parallel applications require tasks to share data with each other.
Latency vs. Bandwidth: Latency is the time it takes to send a minimal message between
two tasks. Bandwidth is the amount of data that can be communicated per unit of time.
Sending many small messages can cause latency to dominate communication overhead.
33
Designing parallel programs -
communication:
Most parallel applications require tasks to share data with each other.
Latency vs. Bandwidth: Latency is the time it takes to send a minimal message between
two tasks. Bandwidth is the amount of data that can be communicated per unit of time.
Sending many small messages can cause latency to dominate communication overhead.
33
Designing parallel programs -
communication:
Most parallel applications require tasks to share data with each other.
Latency vs. Bandwidth: Latency is the time it takes to send a minimal message between
two tasks. Bandwidth is the amount of data that can be communicated per unit of time.
Sending many small messages can cause latency to dominate communication overhead.
34
Designing parallel programs – load
balancing:
Load balancing is the practice of distributing approximately equal amount of work so that
all tasks are kept busy all the time.
34
Designing parallel programs – load
balancing:
Load balancing is the practice of distributing approximately equal amount of work so that
all tasks are kept busy all the time.
Equally partition the work given to each task: For array/matrix operations equally
distribute the data set among parallel tasks. For loop iterations where the work done
for each iteration is equal, evenly distribute iterations among tasks.
Use dynamic work assignment: Certain class problems result in load imbalance even if
data is distributed evenly among tasks (sparse matrices, adaptive grid methods, many
body simulations, etc.). Use scheduler – task pool approach. As each task finishes, it
queues to get a new piece of work. Modify your algorithm to handle imbalances
dynamically.
34
Designing parallel programs –
I/O:
The Bad News:
❑ I/O operations are inhibitors of parallelism
❑ I/O operations are orders of magnitude slower than memory
operations
❑ Parallel file systems may be immature or not available on all systems
❑ I/O that must be conducted over network can cause severe bottlenecks
35
Designing parallel programs –
I/O:
The Bad News:
❑ I/O operations are inhibitors of parallelism
❑ I/O operations are orders of magnitude slower than memory operations
❑ Parallel file systems may be immature or not available on all systems
❑ I/O that must be conducted over network can cause severe bottlenecks
35
Designing parallel programs –
I/O:
The Bad News:
❑ I/O operations are inhibitors of parallelism
❑ I/O operations are orders of magnitude slower than memory operations
❑ Parallel file systems may be immature or not available on all systems
❑ I/O that must be conducted over network can cause severe bottlenecks
I/O Tips:
❑ Reduce overall I/O as much as possible
❑ If you have access to parallel file system, use it
❑ Writing large chunks of data rather than small ones is significantly more efficient
❑ Fewer, larger files perform much better than many small files
❑ Have a subset of parallel tasks to perform the I/O instead of using all tasks, or
❑ Confine I/O to a single tasks and then broadcast (gather) data to (from) other
tasks
35
Example – array
processing:
# process my portion of
array do i = mystart, myend
a( i ) = fcn( i )
end do
send MASTER
results end if
36