04 Progbasics
04 Progbasics
export*uniform*float*sumall1( export*uniform*float*sumall2(
***uniform*int*N, ***uniform*int*N,
***uniform*float**x) ***uniform*float**x)
{ {
***uniform*float*sum*=*0.0f; ***uniform*float*sum;
***foreach*(i*=*0*...*N) ***float*partial*=*0.0f;
***{ ***foreach*(i*=*0*...*N)
******sum*+=*x[i]; ***{
***} ******partial*+=*x[i];
*** ***}
***return*sum;
} ***//*from*ISPC*math*library
***sum*=*reduceAdd(partial);
***
***return*sum;
}
sum is of type uniform float (one copy of variable for all program instances)
Undefined behavior: All program instances accumulate into sum in parallel
(read-modify-write operation must be atomic for correctness: it is not)
(CMU 15-418, Spring 2012)
ISPC discussion: sum “reduction”
Compute the sum of all array elements in parallel export*uniform*float*sumall2(
***uniform*int*N,
Each instance accumulates a private partial sum ***uniform*float**x)
{
(no communication) ***uniform*float*sum;
***float*partial*=*0.0f;
Partial sums are added together using the reduceAdd() ***foreach*(i*=*0*...*N)
cross-instance communication primitive. The result is the ***{
******partial*+=*x[i];
same for all instances (uniform) ***}
const*int*N*=*1024;
float**x*=*new*float[N];
__mm256*partial*=*_mm256_broadcast_ss(0.0f);
//*populate*x
for*(int*i=0;*i<N;*i+=8)
***partial*=*_mm256_add_ps(partial,*_mm256_load_ps(&x[i]));
float*sum*=*0.f;
for*(int*i=0;*i<8;*i++)
***sum*+=*partial[i];
}
Time (1 processor)
Speedup( P processors ) =
Time (P processors)
** Other goals include efficiency (cost, area, power, etc.), working on bigger problems than on a uniprocessor
(CMU 15-418, Spring 2012)
Steps in creating a parallel program
These steps are
Problem to solve performed by the
programmer and/or
Decomposition
system (compiler,
runtime, hardware)
Subproblems
(“tasks”)
Assignment
Threads **
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
** textbook uses the term
parallel “processes”. We’re referring to
machine the same concept
▪ Then speedup ≤ /S
1
▪ Sequential implementation N
- Both phases take N2 time: total is 2N2
N
Parallelism
N2 N2
1
Execution time
(CMU 15-418, Spring 2012)
First attempt at parallelism (P processors)
▪ Strategy:
- Phase 1: execute in parallel
- time for phase 1: N2/P
- Phase 2: execute serially
- time for phase 2: N2 P Sequential program
Parallelism
▪ Overall performance:
N2 N2
- Speedup 1
Execution time
N2/P
P
Parallel program
-
Parallelism
Speedup ≤ 2
N2
1
Execution time
(CMU 15-418, Spring 2012)
Parallelizing phase 2
▪ Strategy:
- Phase 1: execute in parallel
- time for phase 1: N2/P
- Phase 2: execute partial summations in parallel, combine results serially
- time for phase 2: N2/P + P
▪ Overall performance:
- Speedup overhead:
combining the partial sums
N2/P N2/P + P
P
Parallel program
Parallelism
Execution time
(CMU 15-418, Spring 2012)
Amdahl’s law
▪ Let S = the fraction of sequential execution that is inherently sequential
▪ Max speedup on P processors given by:
speedup
S=0.01
Max Speedup
S=0.05
S=0.1
Processors
(CMU 15-418, Spring 2012)
Decomposition
▪ Who is responsible for performing decomposition?
- In many cases: the programmer
Decomposition
Subproblems
(“tasks”)
Assignment
Threads
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
parallel
machine
(CMU 15-418, Spring 2012)
Assignment
▪ Assigning tasks to threads
- Think of the threads as “workers”
Decomposition
Subproblems
(“tasks”)
Assignment
Threads
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
parallel
machine
(CMU 15-418, Spring 2012)
Orchestration
▪ Involves:
- Structuring communication
- Adding synchronization to preserve dependencies
- Organizing data structures in memory, scheduling tasks
Decomposition
Subproblems
(“tasks”)
Assignment
Threads
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
parallel
machine
(CMU 15-418, Spring 2012)
Mapping
▪ Mapping “threads” to execution units
▪ Usually a job for the OS
▪ Many mapping decisions are trivial in parallel programs
- Parallel application uses the entire machine
- So oversubscribing machine with multiple parallel apps is not common
Often, the reason a problem requires lots of computation (and needs to be parallelized) is
that it involves a lot of data.
Often equally valid to think of partitioning data. (computations go with the data)
But there are many computations where the correspondence between “tasks” and data is
less clear. In these cases it’s natural to think of partitioning computation.
(CMU 15-418, Spring 2012)
A parallel programming example
N
A[i,j]*=*0.2***A[i,j]*+*A[i,jX1]*+*A[iX1,j]
**********************+*A[i,j+1]*+*A[i+1,j];*
N
...
...
Possible strategy:
1. Partition grid cells on a diagonal into tasks
2. Update values in parallel
N
3. When complete, move to next diagonal
...
...
assignment:
specified explicitly
(block assignment)
decomposition:
tasks are individual
elements
Orchestration:
handled by system
(End of for_all block is implicit wait for all
workers before returning to sequential control)
(CMU 15-418, Spring 2012)
Shared address space solver
SPMD execution model
partial sum
r1*←*diff T0*reads*value*0
r1*←*diff T1*reads*value*0
r1*←*r1*+*r2 T0*sets*value*of*its*r1*to*1
r1*←*r1*+*r2 T1*sets*value*of*its*r1*to*1
diff*←r1 T0*stores*1*to*diff
diff*←r1 T0*stores*1*to*diff
T0 T1
//*produce*x,*then*let*T1*know
X*=*1;
flag*=*1;
while*(flag*==*0);
print*X;
Camera