Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
Linda Woodard
CAC
19 May 2010
5/18/2010 www.cac.cornell.edu 1
What is Parallel Programming?
Using more than one processor or computer to complete a task
5/18/2010 www.cac.cornell.edu 2
Forest Inventory with Ranger Bob
5/18/2010 www.cac.cornell.edu 3
Why Do Parallel Programming?
• Limits of single CPU computing
– performance
– available memory
• We can solve…
– larger problems
– faster
– more cases
5/18/2010 www.cac.cornell.edu 4
Forest Inventory with Ranger Bob
5/18/2010 www.cac.cornell.edu 5
Terminology (1)
• serial code is a single thread of execution working on a single data item at any
one time
• parallel code has more than one thing happening at a time. This could be
– A single thread of execution operating on multiple data items simultaneously
– Multiple threads of execution in a single executable
– Multiple executables all working on the same problem
– Any combination of the above
• task is the name we use for an instance of an executable. Each task has its own
virtual address space and may have multiple threads.
5/18/2010 www.cac.cornell.edu 6
Terminology (2)
• node: a discrete unit of a computer system that typically runs its own instance
of the operating system
• grid: the software stack designed to handle the technical and social challenges
of sharing resources across networking and institutional boundaries. grid also
applies to the groups that have reached agreements to share their resources.
5/18/2010 www.cac.cornell.edu 7
Limits of Parallel Computing
• Theoretical Upper Limits
– Amdahl’s Law
• Practical Limits
– Load balancing (waiting)
– Conflicts (accesses to shared memory)
– Communications
– I/O (file system access)
5/18/2010 www.cac.cornell.edu 8
Theoretical Upper Limits to Performance
• All parallel programs contain:
– parallel sections (we hope!)
– serial sections (unfortunately)
4 tasks
5/18/2010 www.cac.cornell.edu 9
Amdahl’s Law
• Amdahl’s Law places a strict limit on the speedup that can be
realized by using multiple processors.
– Effect of multiple processors on run time
t n = (f p / N + f s )t 1
– Where
• fs = serial fraction of code
• fp = parallel fraction of code
• N = number of processors
• t 1 = time to run on one processor
5/18/2010 www.cac.cornell.edu 10
Limit Cases of Amdahl’s Law
• Speed up formula:
S = 1 / (fs + fp / N)
Where
• fs = serial fraction of code
• fp = parallel fraction of code
• N = number of processors
Case:
1. fs = 0, fp = 1, then S = N
2. N infinity: S = 1/fs; if 10% of the code is sequential, you will never
speed up by more than 10, no matter the number of processors.
5/18/2010 www.cac.cornell.edu 11
Ilustration of Amdahl’s Law
250
fp = 1.000
200
fp = 0.999
fp = 0.990
S 150 fp = 0.900
100
50
0
0 50 100 150 200 250
Number of processors
5/18/2010 www.cac.cornell.edu 12
More Terminology
• synchronization: the temporal coordination of parallel tasks. It involves
waiting until two or more tasks reach a specified point (a sync point) before
continuing any of the tasks.
5/18/2010 www.cac.cornell.edu 13
Practical Limits: Amdahl’s Law vs. Reality
• Amdahl’s Law shows a theoretical upper limit || speedup
• In reality, the situation is even worse than predicted by Amdahl’s Law due to:
• – Load balancing (waiting)
• – Scheduling (shared processors or memory)
• – Communications
80
• – I/O f = 0.99
p
70
60
50
Amdahl's Law
S 40 Reality
30
20
10
0
0 50 100 150 200 250
Number of processors
5/18/2010 www.cac.cornell.edu 14
Forest Inventory with Ranger Bob
5/18/2010 www.cac.cornell.edu 15
Is it really worth it to go Parallel?
• Writing effective parallel applications is difficult!!
– Load balance is important
– Communication can limit parallel efficiency
– Serial time can dominate
5/18/2010 www.cac.cornell.edu 16
Types of Parallel Computers (Flynn's taxonomy)
Data Stream
Single Multiple
Single Instruction Single Instruction
Single Single Data Multiple Data
SISD SIMD
Instruction Stream
Multiple Instruction Multiple Instruction
Single Data Multiple Data
Multiple
MISD MIMD
5/18/2010 www.cac.cornell.edu 17
Types of Parallel Computers (Memory Model)
• Nearly all parallel machines these days are multiple instruction, multiple
data (MIMD)
5/18/2010 www.cac.cornell.edu 18
Shared and Distributed Memory Models
P P P P P P
P P P P P P
M M M M M M
Bus
Memory
Network
Shared memory: single address space. All Distributed memory: each processor
processors have access to a pool of shared has its own local memory. Must do
memory; easy to build and program, good message passing to exchange data
price-performance for small numbers of between processors. cc-NUMA enables
processors; predictable performance due to larger number of processors and shared
UMA .(example: SGI Altix) memory address space than SMPs; still
easy to program, but harder and more
Methods of memory access : expensive to build. (example: Clusters)
- Bus
- Crossbar Methods of memory access :
- various topological interconnects
5/18/2010 www.cac.cornell.edu 19
Ranger
5/18/2010 www.cac.cornell.edu 20
Shared Memory vs. Distributed Memory
• Tools can be developed to make any system appear to look like a
different kind of system
– distributed memory systems can be programmed as if they have shared
memory, and vice versa
– such tools do not produce the most efficient code, but might enable
portability
5/18/2010 www.cac.cornell.edu 21
Programming Parallel Computers
• Programming single-processor systems is (relatively) easy because
they have a single thread of execution and a single address space.
5/18/2010 www.cac.cornell.edu 23
Data Parallel Programming Example
One code will run on 2 CPUs
Program has array of data to be operated on by 2 CPUs so array is split in two.
CPU A CPU B
program: program: program:
… … …
if CPU=a then low_limit=1 low_limit=51
low_limit=1
upper_limit=50 upper_limit=100
upper_limit=50
elseif CPU=b then do I= low_limit, do I= low_limit,
low_limit=51 upper_limit upper_limit
upper_limit=100 work on A(I) work on A(I)
end if end do end do
do I = low_limit, … …
upper_limit end program end program
work on A(I)
end do
...
end program
5/18/2010 www.cac.cornell.edu 24
Forest Inventory with Ranger Bob
5/18/2010 www.cac.cornell.edu 25
Single Program, Multiple Data (SPMD)
SPMD: dominant programming model for shared and distributed
memory machines.
– One source code is written
– Code can have conditional execution based on which processor is
executing the copy
– All copies of code are started simultaneously and communicate and
sync with each other periodically
5/18/2010 www.cac.cornell.edu 26
SPMD Programming Model
source.c
5/18/2010 www.cac.cornell.edu 28
Forest Inventory with Ranger Bob
5/18/2010 www.cac.cornell.edu 29
Shared Memory Programming: OpenMP
• Shared memory systems (SMPs and cc-NUMAs) have a single
address space:
– applications can be developed in which loop iterations (with no
dependencies) are executed by different processors
– shared memory codes are mostly data parallel, ‘SIMD’ kinds of codes
– OpenMP is the new standard for shared memory programming
(compiler directives)
– Vendors offer native compiler directives
5/18/2010 www.cac.cornell.edu 30
Accessing Shared Variables
• If multiple processors want to write to a shared variable at the same time,
there could be conflicts :
– Process 1 and 2
– read X
– compute X+1
– write X
Shared variable X in
• Programmer, language, and/or memory
architecture must provide ways
of resolving conflicts
X + 1 in proc1 X + 1 in proc2
5/18/2010 www.cac.cornell.edu 31
OpenMP Example #1: Parallel Loop
!$OMP PARALLEL DO
do i=1,128
b(i) = a(i) + c(i)
end do
!$OMP END PARALLEL DO
The first directive specifies that the loop immediately following should be
executed in parallel. The second directive specifies the end of the parallel
section (optional).
For codes that spend the majority of their time executing the content of
simple loops, the PARALLEL DO directive can result in significant parallel
performance.
5/18/2010 www.cac.cornell.edu 32
OpenMP Example #2: Private Variables
!$OMP PARALLEL DO SHARED(A,B,C,N) PRIVATE(I,TEMP)
do I=1,N
TEMP = A(I)/B(I)
C(I) = TEMP + SQRT(TEMP)
end do
!$OMP END PARALLEL DO
In this loop, each processor needs its own private copy of the variable
TEMP. If TEMP were shared, the result would be unpredictable since
multiple processors would be writing to the same memory location.
5/18/2010 www.cac.cornell.edu 33
Distributed Memory Programming: MPI
Distributed memory systems have separate address spaces for
each processor
5/18/2010 www.cac.cornell.edu 34
Data Decomposition
For distributed memory systems, the ‘whole’ grid or sum of particles
is decomposed to the individual nodes
– Each node works on its section of the problem
– Nodes can exchange information
5/18/2010 www.cac.cornell.edu 35
MPI Example
#include <. . . .>
#include "mpi.h"
main(int argc, char **argv)
{
char message[20];
int i, rank, size, type = 99;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
strcpy(message, "Hello, world");
for (i = 1; i < size; i++)
MPI_Send(message, 13, MPI_CHAR, i, type, MPI_COMM_WORLD);
}
else
MPI_Recv(message, 20, MPI_CHAR, 0, type, MPI_COMM_WORLD, &status);
printf( "Message from process = %d : %.13s\n", rank,message);
MPI_Finalize();
}
5/18/2010 www.cac.cornell.edu 36
MPI: Sends and Receives
MPI programs must send and receive data between the processors
(communication)
The most basic calls in MPI (besides the initialization, rank/size, and
finalization calls) are:
– MPI_Send
– MPI_Recv
5/18/2010 www.cac.cornell.edu 37
Programming Multi-tiered Systems
• Systems with multiple shared memory nodes are becoming common
for reasons of economics and engineering.
5/18/2010 www.cac.cornell.edu 38