Worksharing and Parallel Loops

Worksharing and Parallel Loops
Intended Learning Outcomes
• Know about difference between SPMD and work-sharing

• Know how to express work-sharing in OpenMP
• Use parallel loop in the pi application
• Know about loop scheduling and try them against the
STREAM benchmark
• Know about OpenMP sections
SPMD vs Worksharing
A parallel construct by itself create a Single Program Multiple

Data (SPMD) program, i.e, each thread executes the same
code on different data. How to split up pathways through
the code, such that threads works in different regions or with
different data?
This is called worksharing

• Loop construct
– sections constructs
– single construct
– task construct …. Available in OpenMP 3.0
Worksharing
Threads are assigned, an independent parallel

subset of the total workload
Worksharing
For example, different chunks of an iteration
is distributed among the threads.
i=0 i=4 i=8 i=12
i=1 i=5 i=9 i=13
i=2 i=6 i=10 i=14
i=3 i=7 i=11 i=15
Implicit barrier 4
OpenMP loop worksharing construct
OpenMP’s loop worksharing construct splits loop iterations

among all active threads
#pragma omp parallel Loop construct

{ name:
#pragma omp for • C/C++: for
for (i=0;i<N;i++){
foo(I); • Fortran: do
}
}
The variable i is made “private” to each thread
by default. You could do this explicitly with a
“private(i)” clause
Worksharing – a motivating example
Sequential for(i=0;i<N;i++) { a[i] = a[i] + b[i];}

code
#pragma omp parallel
{
int id, i, Nthrds, istart, iend;
OpenMP id = omp_get_thread_num();
parallel Nthrds = omp_get_num_threads();
region istart = id * N / Nthrds;
(SPMD) iend = (id+1) * N / Nthrds;
if (id == Nthrds-1) iend = N;
By default, there is a barrier at the end of
for(i=istart;i<iend;i++) { a[i] = a[i]
the parallel loop.+ b[i];}
Use the nowait clause to
} turn off the barrier.
OpenMP parallel
region and a #pragma omp parallel
worksharing #pragma omp for
for construct for(i=0;i<N;i++) { a[i] = a[i] + b[i];}
Combined OpenMP constructs
OpenMP allows for a combined parallel and worksharing

directive on the same line
#pragma omp parallel #pragma omp parallel for

{ for (i=0;i< MAX; i++) {
#pragma omp for res[i] = huge();
for (i=0;i< MAX; i++) { }
res[i] = huge();
}
}
These are equivalent
However, for performance reason one should aim at having

as large as possible parallel regions why?
Working with loops
Basic approach:
• Find compute intensive loops (use a profiler!)
• Make the loop iterations independent .. So they can safely
execute in any order without loop-carried dependencies
• Place the appropriate OpenMP directive and test
int i, A[MAX]; Where is
int i, j, A[MAX]; int
j =i,5;A[MAX]; the bug?
j = 5; #pragma omp parallel for
for (i=0;i< MAX; i++) { for (i=0;i< MAX; i++) {
j +=2; int j = 5 + 2*(i+1);
j +=2;
A[i] = big(j); A[i] = big(j);
} }
OpenMP collapse clause
For perfectly nested rectangular loops we can parallelize

multiplei1=0;
loops i2=0;
in the nest with the collapse clause:
for (int i1i2=0; i1i2<n1*n2; i1i2++) {
#pragma omp parallel for collapse(2)
…
for (int i1=0; i1<n1; i1++) {
i2++;
for (int i2=0; i2<n2; i2++) {
if (i2 == n2){
.....
i2 = 0;
}
i1++;
}
}
}
• Will form a single loop of n1xn2 iterations and then split

them on threads
Reductions
OpenMP reduction is used to create code for recurrence

calculations (associative and communitive operators) so that
they can be performed in parallel
• Very common in numerical methods e.g. computing
averages, norms or finding min/max
double avg=0.0;
double A[MAX];
for(int i=0;i< MAX; i++) {
avg + = A[i];
}
avg /= MAX;
• Support in most parallel programming environments

Reduction in OpenMP
OpenMP reduction clause:

reduction (op : list)
Inside a parallel or a work-sharing
construct:
• A local copy of each variable in the double avg=0.0;
list is created and initialized double A[MAX];
#pragma omp parallel for reduction (+:ave)
depending on the “op” (e.g. 0 for “+”).
for(int i=0;i< MAX; i++) {
• Updates to the local copy. avg + = A[i];
• Local copies are reduced into a }
single value and combined with the avg /= MAX;
original global value.
The variables in “list” must be shared
in the enclosing parallel region
Exercise 2.3
• Go back to the serial pi.c program and parallelize it with a

loop construct
• Your goal is to minimize the number of changes made to
the serial program. For instance, you can use the reduce
clause.
• How does it perform ?
OpenMP Loop Scheduling
The OpenMP runtime decides how the loop iterations are

distributed among all threads - scheduling
OpenMP defines the following choices of loop scheduling:
• Static – Predefined at compile time. Lowest overhead,
predictable
• Dynamic – Selection made at runtime
• Guided – Special case of dynamic; attempts to reduce
overhead
• Auto – When the runtime can “learn” from previous
executions of the same loop
The schedule clause
The schedule clause affects how loop iterations are

mapped onto threads:
• schedule(static [chunk])
– Deal-out blocks of iterations of size “chunk” to each thread.
• schedule(dynamic[chunk])
– Each thread grabs “chunk” iterations off a queue until all iterations
have been handled.
• schedule(guided[chunk])
– Threads dynamically grab blocks of iterations. The size of the block
starts large and shrinks down to size “chunk” as the calculation
proceeds.
• schedule(runtime)
– Schedule and chunk size taken from the OMP_SCHEDULE
environment variable (or the runtime library for OpenMP 3.0).
• schedule (auto)
– Schedule is up to the run-time to choose (does not have to be any of
the above).
The schedule clause
Simple example to demonstrate the different strategies

program schedule schedule.f90
use omp_lib schedule.gp
use, intrinsic :: iso_c_binding set terminal png
integer :: i
integer, parameter :: n = 1000 set autoscale
integer :: buffer(n)
set output "schedule.png"
interface set xlabel "Iteration"
subroutine usleep(u) bind(c)
use, intrinsic :: iso_c_binding set ylabel "Thread ID"
integer(kind=c_long), value :: u
end subroutine usleep unset key
end interface
plot "schedule.dat" using 1:2 ls 4
!$omp parallel do schedule(runtime)
do i = 1,n
buffer(i) = omp_get_thread_num()
call usleep(int(rand(buffer(i))*2000, kind=c_long))
end do
!$omp end parallel do
> gfortran –fopenmp schedule.f90
open(1, file="schedule.dat") > setenv OMP_SCHEDULE static
do i = 1, n > ./a.out
write(1, *) i, buffer(i)
end do > gnuplot schedule.gp
close(1)
Now check schedule.png
end program schedule
The schedule clause
Static Dynamic Guided

Why different scheduling algorithms
Static scheduling works well in most #pragma omp single nowait {

serial_work()
situation, with threads running }
parallel_work()
almost in sync, and where the cost
of each iteration is more or less
constant.
serial_work()
{ parallel_work()
Other scheduling algorithms are Do some work {
} #pragma omp for nowait schedule()
mainly used to address issues when {
iteration costs varies or threads …
}
runs out of sync. #pragma omp for schedule()
{
…
}
Consider the example to the right, #pragma omp for schedule()
what will happen if static scheduling {
…
is used? }
}
Exercise 3 – STREAM copy
STREAM is a widely used benchmark to measure the

memory bandwidth of a system. You will use it to study the
effect of multiple threads on memory bandwidth for a
Beskow node
• Using OpenMP in STREAM COPY
#pragma omp parallel for
for (j=0; j<STREAM_ARRAY_SIZE; j++)
c[j] = a[j];
• Running STREAM
export OMP_NUM_THREADS=4
./stream
Exercise 3 – Possible Issues
• How are threads in STREAM assigned to cores in the

node?
• There are two processor chips in the node (Beskow).
• Each chip introduces a separate limit
• How are threads distributed across cores?
• Are these measurements repeatable?
• STREAM code makes no effort to get repeatable
result
Exercise 3 – STREAM and loop scheduling
• STREAM as distributed uses the default (static) schedule

• Best when loop limits known, work per iteration constant,
cores only used by the application
• Question: Are all of those assumptions correct?
Exercise 3 – STREAM and loop scheduling
Question: Are all of those assumptions correct?

• The last one (cores only used be application) is mostly
unrealistic
Try running STREAM with one thread per available core and
schedule:
• Static
• Dynamic
• Guided
• How do they perform?
Sections worksharing construct
The sections worksharing construct assign a different code to

each thread
{ #pragma omp sections
{
#pragma omp section
x_calculation();
#pragma omp section By default, there is a barrier at the end of
y_calculation(); the omp sections. Use the nowait clause to
#pragma omp section turn off the barrier.
z_calculation();
}
}
Worksharing with tasks
double sum(*a, n) {
New construct since OpenMP 3.0 #pragma omp single nowait
to express task-parallelism res = parallel_sum(a, n)
return res
}
Useful for problems, where the
parallelism can’t be extracted from double parallel_sum(*a, n) {
loops, i.e. recursive algorithms, or h = n / 2
irregular algorithms (particle #pragma omp task shared(x)
tracking) x = parallel_sum(a, h)
#pragma omp task shared(y)
y = parallel_sum(a + h, n – h)
#pragma omp taskwait
x += y
return x
}

Worksharing and Parallel Loops

Uploaded by

Copyright:

Available Formats

Worksharing and Parallel Loops

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Worksharing and Parallel Loops

Uploaded by

Copyright:

Available Formats

Worksharing and Parallel Loops

Intended Learning Outcomes

• Know about difference between SPMD and work-sharing

A parallel construct by itself create a Single Program Multiple

This is called worksharing

Threads are assigned, an independent parallel

OpenMP’s loop worksharing construct splits loop iterations

#pragma omp parallel Loop construct

Sequential for(i=0;i<N;i++) { a[i] = a[i] + b[i];}

OpenMP allows for a combined parallel and worksharing

#pragma omp parallel #pragma omp parallel for

These are equivalent

However, for performance reason one should aim at having

For perfectly nested rectangular loops we can parallelize

• Will form a single loop of n1xn2 iterations and then split

OpenMP reduction is used to create code for recurrence

• Support in most parallel programming environments

OpenMP reduction clause:

• Go back to the serial pi.c program and parallelize it with a

The OpenMP runtime decides how the loop iterations are

The schedule clause affects how loop iterations are

Simple example to demonstrate the different strategies

Static Dynamic Guided

Static scheduling works well in most #pragma omp single nowait {

STREAM is a widely used benchmark to measure the

• How are threads in STREAM assigned to cores in the

• STREAM as distributed uses the default (static) schedule

Question: Are all of those assumptions correct?

The sections worksharing construct assign a different code to

You might also like