Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Worksharing and Parallel Loops

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Worksharing and Parallel Loops

Intended Learning Outcomes

• Know about difference between SPMD and work-sharing


• Know how to express work-sharing in OpenMP
• Use parallel loop in the pi application
• Know about loop scheduling and try them against the
STREAM benchmark
• Know about OpenMP sections
SPMD vs Worksharing

A parallel construct by itself create a Single Program Multiple


Data (SPMD) program, i.e, each thread executes the same
code on different data. How to split up pathways through
the code, such that threads works in different regions or with
different data?

This is called worksharing


• Loop construct
– sections constructs
– single construct
– task construct …. Available in OpenMP 3.0
Worksharing

Threads are assigned, an independent parallel


subset of the total workload
Worksharing
For example, different chunks of an iteration
is distributed among the threads.
i=0 i=4 i=8 i=12
i=1 i=5 i=9 i=13
i=2 i=6 i=10 i=14
i=3 i=7 i=11 i=15

Implicit barrier 4
OpenMP loop worksharing construct

OpenMP’s loop worksharing construct splits loop iterations


among all active threads

#pragma omp parallel Loop construct


{ name:
#pragma omp for • C/C++: for
for (i=0;i<N;i++){
foo(I); • Fortran: do
}
}
The variable i is made “private” to each thread
by default. You could do this explicitly with a
“private(i)” clause
Worksharing – a motivating example

Sequential for(i=0;i<N;i++) { a[i] = a[i] + b[i];}


code
#pragma omp parallel
{
int id, i, Nthrds, istart, iend;
OpenMP id = omp_get_thread_num();
parallel Nthrds = omp_get_num_threads();
region istart = id * N / Nthrds;
(SPMD) iend = (id+1) * N / Nthrds;
if (id == Nthrds-1) iend = N;
By default, there is a barrier at the end of
for(i=istart;i<iend;i++) { a[i] = a[i]
the parallel loop.+ b[i];}
Use the nowait clause to
} turn off the barrier.
OpenMP parallel
region and a #pragma omp parallel
worksharing #pragma omp for
for construct for(i=0;i<N;i++) { a[i] = a[i] + b[i];}
Combined OpenMP constructs

OpenMP allows for a combined parallel and worksharing


directive on the same line

#pragma omp parallel #pragma omp parallel for


{ for (i=0;i< MAX; i++) {
#pragma omp for res[i] = huge();
for (i=0;i< MAX; i++) { }
res[i] = huge();
}
}

These are equivalent

However, for performance reason one should aim at having


as large as possible parallel regions why?
Working with loops

Basic approach:
• Find compute intensive loops (use a profiler!)
• Make the loop iterations independent .. So they can safely
execute in any order without loop-carried dependencies
• Place the appropriate OpenMP directive and test
int i, A[MAX]; Where is
int i, j, A[MAX]; int
j =i,5;A[MAX]; the bug?
j = 5; #pragma omp parallel for
for (i=0;i< MAX; i++) { for (i=0;i< MAX; i++) {
j +=2; int j = 5 + 2*(i+1);
j +=2;
A[i] = big(j); A[i] = big(j);
} }
OpenMP collapse clause

For perfectly nested rectangular loops we can parallelize


multiplei1=0;
loops i2=0;
in the nest with the collapse clause:
for (int i1i2=0; i1i2<n1*n2; i1i2++) {
#pragma omp parallel for collapse(2)

for (int i1=0; i1<n1; i1++) {
i2++;
for (int i2=0; i2<n2; i2++) {
if (i2 == n2){
.....
i2 = 0;
}
i1++;
}
}
}

• Will form a single loop of n1xn2 iterations and then split


them on threads
Reductions

OpenMP reduction is used to create code for recurrence


calculations (associative and communitive operators) so that
they can be performed in parallel
• Very common in numerical methods e.g. computing
averages, norms or finding min/max
double avg=0.0;
double A[MAX];
for(int i=0;i< MAX; i++) {
avg + = A[i];
}
avg /= MAX;

• Support in most parallel programming environments


Reduction in OpenMP

OpenMP reduction clause:


reduction (op : list)
Inside a parallel or a work-sharing
construct:
• A local copy of each variable in the double avg=0.0;
list is created and initialized double A[MAX];
#pragma omp parallel for reduction (+:ave)
depending on the “op” (e.g. 0 for “+”).
for(int i=0;i< MAX; i++) {
• Updates to the local copy. avg + = A[i];
• Local copies are reduced into a }
single value and combined with the avg /= MAX;
original global value.
The variables in “list” must be shared
in the enclosing parallel region
Exercise 2.3

• Go back to the serial pi.c program and parallelize it with a


loop construct
• Your goal is to minimize the number of changes made to
the serial program. For instance, you can use the reduce
clause.
• How does it perform ?
OpenMP Loop Scheduling

The OpenMP runtime decides how the loop iterations are


distributed among all threads - scheduling
OpenMP defines the following choices of loop scheduling:
• Static – Predefined at compile time. Lowest overhead,
predictable
• Dynamic – Selection made at runtime
• Guided – Special case of dynamic; attempts to reduce
overhead
• Auto – When the runtime can “learn” from previous
executions of the same loop
The schedule clause

The schedule clause affects how loop iterations are


mapped onto threads:
• schedule(static [chunk])
– Deal-out blocks of iterations of size “chunk” to each thread.
• schedule(dynamic[chunk])
– Each thread grabs “chunk” iterations off a queue until all iterations
have been handled.
• schedule(guided[chunk])
– Threads dynamically grab blocks of iterations. The size of the block
starts large and shrinks down to size “chunk” as the calculation
proceeds.
• schedule(runtime)
– Schedule and chunk size taken from the OMP_SCHEDULE
environment variable (or the runtime library for OpenMP 3.0).
• schedule (auto)
– Schedule is up to the run-time to choose (does not have to be any of
the above).
The schedule clause

Simple example to demonstrate the different strategies


program schedule schedule.f90
use omp_lib schedule.gp
use, intrinsic :: iso_c_binding set terminal png
integer :: i
integer, parameter :: n = 1000 set autoscale
integer :: buffer(n)
set output "schedule.png"
interface set xlabel "Iteration"
subroutine usleep(u) bind(c)
use, intrinsic :: iso_c_binding set ylabel "Thread ID"
integer(kind=c_long), value :: u
end subroutine usleep unset key
end interface
plot "schedule.dat" using 1:2 ls 4
!$omp parallel do schedule(runtime)
do i = 1,n
buffer(i) = omp_get_thread_num()
call usleep(int(rand(buffer(i))*2000, kind=c_long))
end do
!$omp end parallel do
> gfortran –fopenmp schedule.f90
open(1, file="schedule.dat") > setenv OMP_SCHEDULE static
do i = 1, n > ./a.out
write(1, *) i, buffer(i)
end do > gnuplot schedule.gp
close(1)
Now check schedule.png
end program schedule
The schedule clause

Static Dynamic Guided


Why different scheduling algorithms

Static scheduling works well in most #pragma omp single nowait {


serial_work()
situation, with threads running }
parallel_work()
almost in sync, and where the cost
of each iteration is more or less
constant.
serial_work()
{ parallel_work()
Other scheduling algorithms are Do some work {
} #pragma omp for nowait schedule()
mainly used to address issues when {
iteration costs varies or threads …
}
runs out of sync. #pragma omp for schedule()
{

}
Consider the example to the right, #pragma omp for schedule()
what will happen if static scheduling {

is used? }
}
Exercise 3 – STREAM copy

STREAM is a widely used benchmark to measure the


memory bandwidth of a system. You will use it to study the
effect of multiple threads on memory bandwidth for a
Beskow node
• Using OpenMP in STREAM COPY
#pragma omp parallel for
for (j=0; j<STREAM_ARRAY_SIZE; j++)
c[j] = a[j];
• Running STREAM
export OMP_NUM_THREADS=4
./stream
Exercise 3 – Possible Issues

• How are threads in STREAM assigned to cores in the


node?
• There are two processor chips in the node (Beskow).
• Each chip introduces a separate limit
• How are threads distributed across cores?
• Are these measurements repeatable?
• STREAM code makes no effort to get repeatable
result
Exercise 3 – STREAM and loop scheduling

• STREAM as distributed uses the default (static) schedule


• Best when loop limits known, work per iteration constant,
cores only used by the application
• Question: Are all of those assumptions correct?
Exercise 3 – STREAM and loop scheduling

Question: Are all of those assumptions correct?


• The last one (cores only used be application) is mostly
unrealistic

Try running STREAM with one thread per available core and
schedule:
• Static
• Dynamic
• Guided
• How do they perform?
Sections worksharing construct

The sections worksharing construct assign a different code to


each thread
#pragma omp parallel
{ #pragma omp sections
{
#pragma omp section
x_calculation();
#pragma omp section By default, there is a barrier at the end of
y_calculation(); the omp sections. Use the nowait clause to
#pragma omp section turn off the barrier.
z_calculation();
}
}
Worksharing with tasks
double sum(*a, n) {
#pragma omp parallel
New construct since OpenMP 3.0 #pragma omp single nowait
to express task-parallelism res = parallel_sum(a, n)
return res
}
Useful for problems, where the
parallelism can’t be extracted from double parallel_sum(*a, n) {
loops, i.e. recursive algorithms, or h = n / 2
irregular algorithms (particle #pragma omp task shared(x)
tracking) x = parallel_sum(a, h)
#pragma omp task shared(y)
y = parallel_sum(a + h, n – h)
#pragma omp taskwait
x += y
return x
}

You might also like