Lec 12 OpenMP
Lec 12 OpenMP
Lec 12 OpenMP
OpenMP (Part 1)
1
What is OpenMP
Open specifications for Multi Processing
Long version: Open specifications for MultiProcessing via
collaborative work between interested parties from the hardware
and software industry, government and academia.
• An Application Program Interface (API) that is used to explicitly
direct multi-threaded, shared memory parallelism.
• API components:
– Compiler directives
– Runtime library routines
– Environment variables
• Portability
– API is specified for C/C++ and Fortran
– Implementations on almost all platforms including Unix/Linux and
Windows
• Standardization
– Jointly defined and endorsed by major computer hardware and
software vendors
– Possibility to become ANSI standard 2
Brief History of OpenMP
3
4
Thread
• A process is an instance of a computer program that
is being executed. It contains the program code and
its current activity.
• A thread of execution is the smallest unit of
processing that can be scheduled by an operating
system.
• Differences between threads and processes:
– A thread is contained inside a process. Multiple threads
can exist within the same process and share resources
such as memory. The threads of a process share the
latter’s instructions (code) and its context (values that
its variables reference at any given moment).
– Different processes do not share these resources.
http://en.wikipedia.org/wiki/Process_(computing)
5
Process
• A process contains all the information needed to execute
the program
– Process ID
– Program code
– Data on run time stack
– Global data
– Data on heap
Each process has its own address space.
• In multitasking, processes are given time slices in a
round robin fashion.
– If computer resources are assigned to another process, the
status of the present process has to be saved, in order that
the execution of the suspended process can be resumed at a
later time.
6
Threads
7
OpenMP Programming Model
8
OpenMP is not
– Necessarily implemented identically by all vendors
– Meant for distributed-memory parallel systems (it is designed
for shared address spaced machines)
– Guaranteed to make the most efficient use of shared memory
– Required to check for data dependencies, data conflicts, race
conditions, or deadlocks
– Required to check for code sequences
– Meant to cover compiler-generated automatic parallelization
and directives to the compiler to assist such parallelization
– Designed to guarantee that input or output to the same file is
synchronous when executed in parallel.
9
Fork-Join Parallelism
• OpenMP program begin as a single process: the master thread. The
master thread executes sequentially until the first parallel region
construct is encountered.
• When a parallel region is encountered, master thread
– Create a group of threads by FORK.
– Becomes the master of this group of threads, and is assigned the thread id 0
within the group.
• The statement in the program that are enclosed by the parallel region
construct are then executed in parallel among these threads.
• JOIN: When the threads complete executing the statement in the parallel
region construct, they synchronize and terminate, leaving only the
master thread.
11
OpenMP Code Structure
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
int main()
{
#pragma omp parallel
{
int ID = omp_get_thread_num();
printf("Hello (%d)\n", ID);
printf(" world (%d)\n", ID);
}
}
Set # of threads for OpenMP
In csh
setenv OMP_NUM_THREAD 8
Run: ./a.out
See: http://wiki.crc.nd.edu/wiki/index.php/OpenMP 12
• “Pragma”: stands for “pragmatic information.
A pragma is a way to communicate the
information to the compiler.
• The information is non-essential in the sense
that the compiler may ignore the information
and still produce correct object program.
13
OpenMP Core Syntax
#include “omp.h”
int main ()
{
int var1, var2, var3;
// Serial code
...
// Beginning of parallel section.
// Fork a team of threads. Specify variable scoping
#pragma omp parallel private(var1, var2) shared(var3)
{
// Parallel section executed by all threads
...
// All threads join master thread and disband
}
14
OpenMP C/C++ Directive Format
OpenMP directive forms
– C/C++ use compiler directives
• Prefix: #pragma omp …
– A directive consists of a directive name followed by
clauses
Example: #pragma omp parallel default (shared) private (var1,
var2)
15
OpenMP Directive Format (2)
General Rules:
• Case sensitive
• Only one directive-name may be specified per
directive
• Each directive applies to at most one succeeding
statement, which must be a structured block.
• Long directive lines can be “continued” on
succeeding lines by escaping the newline
character with a backslash “\” at the end of a
directive line.
16
OpenMP parallel Region Directive
#pragma omp parallel [clause list]
Typical clauses in [clause list]
• Conditional parallelization
– if (scalar expression)
• Determine whether the parallel construct creates threads
• Degree of concurrency
– num_threads (integer expresson)
• number of threads to create
• Date Scoping
– private (variable list)
• Specifies variables local to each thread
– firstprivate (variable list)
• Similar to the private
• Private variables are initialized to variable value before the parallel directive
– shared (variable list)
• Specifies variables that are shared among all the threads
– default (data scoping specifier)
• Default data scoping specifier may be shared or none
17
Example:
#pragma omp parallel if (is_parallel == 1) num_threads(8) shared (var_b)
private (var_a) firstprivate (var_c) default (none)
{
/* structured block */
}
• if (is_parallel == 1) num_threads(8)
– If the value of the variable is_parallel is one, create 8 threads
• shared (var_b)
– Each thread shares a single copy of variable b
• private (var_a) firstprivate (var_c)
– Each thread gets private copies of variable var_a and var_c
– Each private copy of var_c is initialized with the value of var_c in main
thread when the parallel directive is encountered
• default (none)
– Default state of a variable is specified as none (rather than shared)
– Singals error if not all variables are specified as shared or private
18
Number of Threads
19
Thread Creation: Parallel Region Example
• Create threads with the parallel construct
#include <stdlib.h>
#include <stdio.h>
Clause to request
#include "omp.h"
threads
int main()
{
int nthreads, tid;
#pragma omp parallel num_threads(4) private(tid)
{
tid = omp_get_thread_num();
printf("Hello world from (%d)\n", tid); Each thread executes a
if(tid == 0) copy of the code
{ within the structured
nthreads = omp_get_num_threads(); block
printf(“number of threads = %d\n”, nthreads);
}
} // all threads join master thread and terminates
}
20
Thread Creation: Parallel Region Example
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
int main(){
int nthreads, A[100] , tid;
// fork a group of threads with each thread having a private tid variable
omp_set_num_threads(4);
#pragma omp parallel private (tid)
{
tid = omp_get_thread_num();
A single copy of A[] is shared
foo(tid, A); between all threads
} // all threads join master thread and terminates
}
21
SPMD vs. Work-Sharing
22
Work-Sharing Construct
Do/for
• Shares iterations of a loop across the group
• Represents a “data parallelism”.
for directive partitions parallel iterations across
threads
Do is the analogous directive in Fortran
Usage:
#pragma omp for [clause list]
/* for loop */
• Implicit barrier at end of for loop
24
Example Using for
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
int main()
{
int nthreads, tid;
omp_set_num_threads(3);
#pragma omp parallel private(tid)
{
int i;
tid = omp_get_thread_num();
printf("Hello world from (%d)\n", tid);
#pragma omp for
for(i = 0; i <=4; i++)
{
printf(“Iteration %d by %d\n”, i, tid);
}
} // all threads join master thread and terminates
}
25
Another Example Using for
• Sequential code to add two vectors
for(i=0;i<N;i++) {c[i] = b[i] + a[i];}
27
int main()
{
int b[3];
char *cptr; Heap
int i;
Stack cptr i
b[0] b[1] b[2]
cptr = malloc(1);
#pragma omp parallel for Thread (1)
for(i=0; i<3; i++) i
b[i]=i; Master thread (0)
} i
Every thread has its own execution context: an address space containing all of the variables the thread
may access. The execution context includes static variables, dynamically allocated data structures in the
heap, and variables on the run-time stack. The execution context includes its own additional run-time
stack. A shared variable has the same address in the execution context of every thread. All threads
have access to shared variables. A private variable has a different address in the execution context of
every thread.
Example. During parallel execution of the for loop, index “i” is a private variable, while “b”, “cptr” and
heap data are shared.
28
• Canonical shape of “for” loop
𝑖𝑛𝑑𝑒𝑥 + +
for(index = start; index {<, 𝑜𝑟 ≤ 𝑜𝑟 ≥ 𝑜𝑟 >} end; 𝑖𝑛𝑑𝑒𝑥 − − )
𝑖𝑛𝑑𝑒𝑥 += 𝑖𝑛𝑐
𝑖𝑛𝑑𝑒𝑥 −= 𝑖𝑛𝑐
– “for” loop must not contain statements that allow the loop to be exited
prematurely.
• Examples include: “break” statement, “return” statement, “exit” statement and “goto”
statement.
– The “continue” statement is allowed.
29
C/C++ for Directive Syntax
#pragma omp for [clause list]
schedule (type [,chunk])
ordered
private (variable list)
firstprivate (variable list)
lastprivate (variable list)
shared (variable list)
reduction (operator: variable list)
collapse (n)
nowait
/* for_loop */
30
Private Clause
• Direct the compiler to make one or more variables private.
#pragma omp parallel for private (j)
for(i = 0; i < M; i++)
for(j=0; j < N; j++)
a[i][j] = min(a[i][j], a[i][k]+tmp[j]);
31
firstprivate Clause
x[0] = 1.0;
for(i=0; i < n; i++){
for(j=1; j<4; j++)
x[j]=g(i, x[j-1]);
answer[i]=x[1]-x[3];
}
x[0] = 1.0;
#pragma omp parallel for private (j) firstprivate (x)
for(i=0; i < n; i++){
for(j=1; j<4; j++)
x[j]=g(i, x[j-1]);
answer[i]=x[1]-x[3];
}
32
lastprivate Clause
• Sequentially last iteration: the iteration that occurs last when the loop is executed
sequentially.
• The lastprivate clause directs the compiler to generate code at the end of the parallel for
loop that copies back to the master thread’s copy of a variable the private copy of the
variable from the thread that executed the sequentially last iteration of the loop.
for(i=0; i < n; i++){
x[0] = 1.0;
for(j=1; j<4; j++)
x[j]= x[j-1]*(i+1);
answer[i]=x[0]+x[1]+x[2]+x[3];
}
n_cubed = x[3];
• In the sequentially last iteration of the loop, x[3] gets assigned the value 𝑛3 .
• To have this value accessible outside the parallel for loop, we declare x to be a lastprivate
variable. #pragma omp parallel for private(j) lastprivate(x)
for(i=0; i < n; i++){
x[0] = 1.0;
for(j=1; j<4; j++)
x[j]= x[j-1]*(i+1);
answer[i]=x[0]+x[1]+x[2]+x[3];
}
n_cubed = x[3]; 33
Reduction
• Serial code
{
double avg = 0.0, a[MAX];
int i;
…
for(i =0; i<MAX; i++) {avg += a[i];}
avg /= MAX;
}
34
Reduction Clause
• Reduction (operator: variable list): specifies how
to combine local copies of a variable in different
threads into a single copy at the master when
threads exit. Variables in variable list are
implicitly private to threads.
– Operators: +, *, -, &, |, ^, &&, and ||
– Usage
#pragma omp parallel reduction(+: sums) num_threads(4)
{
/* compute local sums in each thread
}
/* sums here contains sum of all local instances of sum */
35
Reduction in OpenMP for
• Inside a parallel or a work-sharing construct:
– A local copy of each list variable is made and initialized
depending on operator (e.g. 0 for “+”)
– Compiler finds standard reduction expressions containing
operator and uses it to update the local copy.
– Local copies are reduced into a single value and combined
with the original global value when returns to the master
thread.
{
double avg = 0.0, a[MAX];
int i;
…
#pragma omp parallel for reduction (+:avg)
for(i =0; i<MAX; i++) {avg += a[i];}
avg /= MAX;
}
36
Reduction Operators/Initial-Values
C/C++:
^ 0
&& 1
|| 0
37
Monte Carlo to estimate PI
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
pi = 4.0*count/samples;
printf("Estimate of pi: %7.5f\n", pi);
}
38
OpenMP version of Monte Carlo to Estimate PI
#include <stdio.h>
#include <stdlib.h>
#include “omp.h”
samples = atoi(argv[1]);
42
Static Scheduling
// static scheduling of matrix multiplication loops
#pragma omp parallel default (private) \
shared (a, b, c, dim) num_threads(4)
#pragma omp for schedule(static)
for(i=0; i < dim; i++)
{
for(j=0; j < dim; j++)
{
c[i][j] = 0.0;
for(k=0; j < dim; k++)
c[i][j] += a[i][k]*b[k][j];
} Static schedule maps iterations to threads
at compile time
}
43
Dynamic Scheduling
• The time needed to execute different loop iterations may vary considerably.
for(i=0; i<n; i++)
{
for(j=i; j < n; j++)
a[i][j] = rand();
}
• The first iteration of the outermost loop (i=0) requires n times more work
than the last iteration (i=n-1). Inverting the two loops will not remedy the
imbalance. #pragma omp parallel default (private) \
shared (a, n) private(j) num_threads(4)
#pragma omp for schedule(dynamic)
for(i=0; i<n; i++)
{
for(j=i; j < n; j++)
a[i][j] = rand();
}
44
Environment Variables
45
By default, worksharing for loops end with an implicit
barrier
• nowait: If specified, threads do not synchronize at the
end of the parallel loop
• ordered: specifies that the iteration of the loop must
be executed as they would be in serial program.
• collapse: specifies how many loops in a nested loop
should be collapsed into one large iteration space and
divided according to the schedule clause. The
sequential execution of the iteration in all associated
loops determines the order of the iterations in the
collapsed iteration space.
46
Avoiding Synchronization with nowait
#pragma omp parallel shared(A,B,C) private(id)
{
id = omp_get_thread_num();
A[id] = big_calc1(id);
#pragma omp barrier Barrier: each threads waits till all threads arrive.
#pragma omp for
for(i = 0; i < N; i++) { C[i] = big_calc3(i,A); }
No implicit
#pragma omp for nowait barrier due to
for(i = 0; i < N; i++) {B[i] = big_calc2(C,i); } nowait. Any
thread can begin
A[id] = big_calc4(id); big_calc4()
immediately
} Implicit barrier without waiting
at the end of the
parallel region for other threads
to finish the loop
47
• By default: worksharing for loops end with an
implicit barrier
• nowait clause:
– Modifies a for directive
– Avoids implicit barrier at end of for
48
Loop Collapse
• Allows parallelization of perfectly nested loops without
using nested parallelism
• Compiler forms a single loop and then parallelizes this
{
…
#pragma omp parallel for collapse (2)
for(i=0;i< N; i++)
{
for(j=0;j< M; j++)
{
foo(A,i,j);
}
}
}
49
For Directive Restrictions
50
Lecture 12: Introduction to
OpenMP (Part 2)
51
Performance Issues I
• C/C++ stores matrices in row-major fashion.
• Loop interchanges may increase cache locality
{
…
#pragma omp parallel for
for(i=0;i< N; i++)
{
for(j=0;j< M; j++)
{
A[i][j] =B[i][j] + C[i][j];
}
}
}
52
Performance Issues II
• Move synchronization points outwards. The inner loop is
parallelized.
• In each iteration step of the outer loop, a parallel region is
created. This causes parallelization overhead.
{
…
for(i=0;i< N; i++)
{
#pragma omp parallel for
for(j=0;j< M; j++)
{
A[i][j] =B[i][j] + C[i][j];
}
}
}
53
Performance Issues III
{
…
54
C++: Random Access Iterators Loops
• Parallelization of random access iterator loops is supported
void iterator_example(){
std::vector vec(23);
std::vector::iterator it;
55
Conditional Compilation
• Keep sequential and parallel programs as a single source
code
#if def _OPENMP
#include “omp.h”
#endif
Main()
{
#ifdef _OPENMP
omp_set_num_threads(3);
#endif
for(i=0;i< N; i++)
{
#pragma omp parallel for
for(j=0;j< M; j++)
{
A[i][j] =B[i][j] + C[i][j];
}
}
} 56
Be Careful with Data Dependences
• Whenever a statement in a program reads or writes a memory
location and another statement reads or writes the same
memory location, and at least one of the two statements
writes the location, then there is a data dependence on that
memory location between the two statements. The loop may
not be executed in parallel.
for(i=1;i< N; i++)
{
a[i] = a[i] + a[i-1];
}
58
• Anti-dependence
for(i=0;i< N-1; i++)
{
x = b[i] + c[i];
a[i] = a[i+1] + x;
}
59
for(i=1;i< m; i++)
for(j=0;j<n;j++)
{
a[i][j] = 2.0*a[i-1][j];
}
for(i=1;i< m; i++)
#pragma omp parallel for Poor performance, it requires
for(j=0;j<n;j++)
{
m-1 fork/join steps.
a[i][j] = 2.0*a[i-1][j];
}
• Invert loop to yield better
#pragma omp parallel for private (i) performance(?).
for(j=0;j< n; j++) • With this inverting, only a single
for(i=1;i<m;i++) fork/join step is needed. The data
{ dependences have not changed.
a[i][j] = 2.0*a[i-1][j]; • However, this change affect the
} cache hit rate.
60
• Flow dependence is in general difficult to be
removed.
X = 0.0;
for(i=0;i< N; i++)
{
X = X + a[i];
}
X = 0.0;
#pragma omp parallel for reduction(+:x)
for(i=0;i< N; i++)
{
x = x + a[i];
}
61
• Elimination of induction variables.
idx = N/2+1; isum = 0; pow2 = 1;
for(i=0;i< N/2; i++)
{
a[i] = a[i] + a[idx];
b[i] = isum;
c[i] = pow2;
idx++; isum += i; pow2 *=2;
}
• Parallel version
#pragma omp parallel for shared (a,b)
for(i=0;i< N/2; i++)
{
a[i] = a[i] + a[i+N/2];
b[i] = i*(i-1)/2;
c[i] = pow(2,i);
}
62
• Remove flow dependence using loop skewing
for(i=1;i< N; i++)
{
b[i] = b[i] + a[i-1];
a[i] = a[i]+c[i];
}
• Parallel version
b[1]=b[1]+a[0];
#pragma omp parallel for shared (a,b,c)
for(i=1;i< N-1; i++)
{
a[i] = a[i] + c[i];
b[i+1] = b[i+1]+a[i];
}
a[N-1] = a[N-1]+c[N-1];
63
• A flow dependence that can in general not be
remedied is a recurrence:
for(i=1;i< N; i++)
{
z[i] = z[i] + l[i]*z[i-1];
}
64
Recurrence: LU Factorization of Tridiagonal Matrix
65
• Tx=LUx=Lz=b, z=Ux.
• Proceed as follows:
• Lz=b, Ux=z
• Lz=b is solved by:
z[0] = b[0];
for(i=1;i< n; i++)
{
z[i] = b[i] - l[i]*z[i-1];
}
alpha beta
delta
gamma
epsilon
70
Synchronization I
74
• Two threads complete the work
task_ptr task_ptr
75
int main(int argc, char argv[])
{
struct job_struct job_ptr;
struct task_struct *task_ptr;
…
#pragma omp parallel private (task_ptr)
{
task_ptr = get_next_task(&job_ptr); The execution of the
while(task_ptr != NULL){
complete_task(task_ptr);
code block after the
task_ptr = get_next_task(&job_ptr); parallel program is
} replicated among the
}
… threads
}
76
77
{
…
#pragma omp parallel
{
#pragma omp for nowait shared(best_cost)
for(i=0; i<N; i++){
int my_cost;
Only one thread at a time
my_cost = estimate(i); executes if() statement. This
#pragma omp critical ensures mutual exclusion when
{ accessing shared data.
if(best_cost < my_cost) Without critical, this will set up
a race condition, in which the
best_cost = my_cost; computation exhibits
} nondeterministic behavior
} when performed by multiple
} threads accessing a shared
variable
}
78
Synchronization: atomic
• atomic provides mutual exclusion but only applies to the
load/update of a memory location.
• This is a lightweight, special form of a critical section.
• It is applied only to the (single) assignment statement that
immediately follows it.
{
…
#pragma omp parallel
{
double tmp, B;
…. Atomic only protects the update of X.
#pragma omp atomic
{
X+=tmp;
}
}
}
79
“ic” is a counter. The atomic construct ensures that no updates
are lost when multiple threads are updating a counter value.
80
• Atomic construct may only be used together with an expression
statement with one of operations: +, *, -, /, &, ^, |, <<, >>.
81
Synchronization: barrier
Suppose each of the following two loops are run in parallel
over i, this may give a wrong answer.
82
for(i= 0; i<N; i++)
a[i] = b[i] + c[i]; wait
for(i= 0; i<N; i++) barrier
d[i] = a[i] + b[i];
To avoid race condition:
• NEED: All threads wait at the barrier point and only continue
when all threads have reached the barrier point.
Barrier syntax:
• #pragma omp barrier
83
Synchronization: barrier
barrier: each threads waits until all threads arrive
85
“master” Construct
• The “master” construct defines a structured block that is only executed
by the master thread.
• The other threads skip the “master” construct. No synchronization is
implied.
• It does not have an implied barrier on entry or exit.
• The lack of a barrier may lead to problems.
86
• Master construct to initialize the data
87
“single” Construct
• The “single” construct builds a block of code that is
executed by only one thread (not necessarily the master
thread).
• A barrier is implicitly set at the end of the single block (the
barrier can be removed by the nowait clause)
#pragma omp parallel
{
…
#pragma omp single
{
exchange_information();
}
do_other_things();
…
}
88
• Single construct to initialize a shared variable
89
Synchronization: ordered
• The “ordered” region executes in the sequential
order
#pragma omp parallel private (tmp)
{
…
#pragma omp for ordered reduction(+:res)
for(i=0;i<N;i++)
{
tmp = compute(i);
#pragma ordered
res += consum(tmp);
}
do_other_things();
…
}
90
Synchronization: Lock routines
• A lock implies a memory fence of all thread visible variables.
• These routines are used to guarantee that only one thread
accesses a variable at a time to avoid race conditions.
• C/C++ lock variables must have type “omp_lock_t” or
“omp_nest_lock_t”.
• All lock functions require an argument that has a pointer to
omp_lock_t or omp_nest_lock_t.
• Simple Lock routines:
– omp_init_lock(omp_lock_t*); omp_set_lock(omp_lock_t*);
omp_unset_lock(omp_lock_t*);
omp_test_lock(omp_lock_t*); omp_destroy_lock(omp_lock_t*);
http://gcc.gnu.org/onlinedocs/libgomp/index.html#Top
91
General Procedure to Use Locks
1. Define the lock variables
2. Initialize the lock via a call to omp_init_lock
3. Set the lock using omp_set_lock or omp_test_lock.
The latter checks whether the lock is actually
available before attempting to set it. It is useful to
achieve asynchronous thread execution.
4. Unset a lock after the work is done via a call to
omp_unset_lock.
5. Remove the lock association via a call to
omp_destroy_lock.
92
Locking Example
93
Initialize a lock
associated with lock
variables “lck” for
use in subsequent
omp_lock_t lck; calls.
omp_init_lock(&lck);
#pragma omp parallel shared(lck) private (tmp, id)
{ Thread waits here
id = omp_get_thread_num(); for its turn.
tmp = do_some_work(id);
omp_set_lock(&lck);
printf(“%d %d\n”, id, tmp); Release the lock so
omp_unset_lock(&lck); that the next thread
} gets a turn
omp_destroy_lock(&lck);
94
Runtime Library Routines
• Routines for modifying/checking number of threads
– omp_set_num_threads(int n);
– int omp_get_num_threads(void);
– int omp_get_thread_num(void);
– int omp_get_max_threads(void);
• Test whether in active parallel region
– int omp_in_parallel(void);
• Allow system to dynamically vary the number of threads from one
parallel construct to another
– omp_set_dynamic(int set)
• set = true: enables dynamic adjustment of team sizes
• set = false: disable dynamic adjustment
– int omp_get_dynamic(void)
• Get number of processors in the system
– int omp_num_procs(void); returns the number of processors online
http://gcc.gnu.org/onlinedocs/libgomp/index.html#Top
95
Default Data Storage Attributes
• A shared variable has a single storage location in memory for the
whole duration of the parallel construct. All threads that
reference such a variable accesses the same memory. Thus,
reading/writing a shared variable provides an easy mechanism for
communicating between threads.
– In C/C++, by default, all program variables except the loop index
become shared variables in a parallel region.
– Global variables are shared among threads
– C: File scope variables, static variables, dynamically allocated
memory (by malloc(), or by new).
• A private variable has multiple storage locations, one within the
execution context of each thread.
– Not shared variables
• Stack variables in functions called from parallel regions are private.
• Automatic variables within a statement block are private.
– This holds for pointer as well. Therefore, do not assign a private
pointer the address of a private variable of another thread. The
result is not defined.
96
/** main file **/ /** file 1 **/
#include <stdio.h> #include <stdio.h>
#include <stdlib.h> #include <stdlib.h>
97
Changing Data Storage Attributes
98
Private Clause
• “private (variable list)” clause creates a new local copy of variables for
each thread.
– Values of these variables are not initialized on entry of the parallel region.
– Values of the data specified in the private clause can no longer be accessed
after the corresponding region terminates (values are not defined on exit of
the parallel region).
100
Lastprivate Clause
• Lastprivate clause passes the value of a private variable from the last
iteration to a global variable.
– It is supported on the work-sharing loop and sections constructs.
– It ensures that the last value of a data object listed is accessible after the
corresponding construct has completed execution.
– In case use with a work-shared loop, the object has the value from the
iteration of the loop that would be last in a “sequential” execution.
102
Default Clause
• C/C++ only has default(shared) or default(none)
• Only Fortran supports default(private)
• Default data attribute is default(shared)
– Exception: #pragma omp task
• Default(none): no default attribute for variables
in static extent. Must list storage attribute for
each variable in static extent. Good programming
practice.
103
Lexical (static) and Dynamic Extent I
104
Lexical and Dynamic Extent II
int main(){
#pragma omp parallel
{ Static extent
print_thread_id();
}
} Dynamic
extent
void print_thread_id()
{
int id = omp_get_thread_num();
printf(“Hello world from thread %d\n”, id);
}
105
106
R. Hartman-Baker. Using OpenMP 107
Threadprivate
• Threadprivate makes global data private to a thread
– C/C++: file scope and static variables, static class members
– Each thread gives its own set of global variables, with initial
values undefined.
• Different from private
– With private clause, global variables are masked.
– Threadrpivate preserves global scope within each thread.
– Parallel regions must be executed by the same number of
threads for global data to persist.
• Threadprivate variables can be initialized using copyin
clause or at time of definition.
108
If all of the conditions below hold, and if a
threadprivate object is referenced in two consecutive
(at run time) parallel regions, then threads with the
same thread number in their respective regions
reference the same copy of that variable:
– Neither parallel region is nested inside another parallel
region.
– The number of threads used to execute both parallel
regions is the same.
109
#include <stdio.h>
#include <stdlib.h>
#include "omp.h"
Threadprivate directive is
int *pglobal; used to give each thread a
#pragma omp threadprivate(pglobal) private copy of the global
pointer pglobal.
int main(){
…
#pragma omp parallel for private(i,j,sum,TID) shared(n,length,check)
for (i=0; i<n;i++)
{
TID = omp_get_thread_num();
if((pglobal = (int*) malloc(length[i]*sizeof(int))) != NULL) {
for(j=sum=0; j < length[i];j++) pglobal[j] = j+1;
sum = calculate_sum(length[i]);
printf(“TID %d: value of sum for I = %d is %d\n”, TID,i,sum);
free(pglobal);
} else {
printf(“TID %d: not enough memory : length[%d] = %d\n", TID,i,length[i]);
}
}
} 110
/* source of function calculate_sum() */
extern int *pglobal;
111
• Each thread has its own copy of sum0, updated in a parallel
region that is called several times. The values for sum0
from one execution of the parallel region will be available
when it is next started. 112
Copyin Clause
• Copyin allows to copy the master thread’s
threadprivate variables to corresponding
threadprivate variables of the other threads.
int global[100];
#pragma omp threadprivate(global)
int main(){
for(int i= 0; i<100; i++) global[i] = i+2; // initialize data
#pragma omp parallel copyin(global)
{
/// parallel region, each thread gets a copy of global, with initialized value
}
}
113
Copyprivate Clause
• Copyprivate clause is supported on the single directive to broadcast values of
privates from one thread of a team to the other threads in the team.
– The typical usage is to have one thread read or initialize private data that is
subsequently used by the other threads as well.
– After the single construct has ended, but before the threads have left the associated
barrier, the values of variables specified in the associated list are copied to the other
threads.
– Do not use copyprivate in combination with the nowait clause.
#include “omp.h”
Void input_parameters(int, int); // fetch values of input parameters
int main(){
int Nsize, choice;
#pragma omp parallel private(Nsize, choice)
{
#pragma omp single copyprivate (Nsize, choice)
input_parameters(Nsize,choice);
do_work(Nsize, choice);
}
}
114
Flush Directive
• OpenMP supports a shared memory model.
– However, processors can have their own “local” high
speed memory, the registers and cache.
– If a thread updates shared data, the new value will first
be saved in register and then stored back to the local
cache.
– The update are thus not necessarily immediately visible
to other threads.
115
Flush Directive
116
Why Task Parallelism?
#include “omp.h”
/* traverse elements in the list */
• Poor performance
117
• Improved performance by sections
• Too many parallel regions
• Extra synchronization
• Not flexible
#include “omp.h”
/* traverse elements in the list */
119
OpenMP 3.0 and Tasks
Tasks allow to parallelize irregular problems
– Unbounded loops
– Recursive algorithms
– Manger/work schemes
– …
A task has
– Code to execute
– Data environment (It owns its data)
– Internal control variables
– An assigned thread that executes the code and the data
Two activities: packaging and execution
– Each encountering thread packages a new instance of a task
(code and data)
– Some thread in the team executes the task at some later time
120
• OpenMP has always had tasks, but they were not
called “task”.
– A thread encountering a parallel construct, e.g., “for”,
packages up a set of implicit tasks, one per thread.
– A team of threads is created.
– Each thread is assigned to one of the tasks.
– Barrier holds master thread till all implicit tasks are
finished.
• OpenMP 3.0 adds a way to create a task explicitly for
the team to execute.
121
Task Directive
#pragma omp task [clauses]
if( logical expression)
untied
shared (list)
private (list)
firstprivate (list)
default(shared | none)
structured block
124
#include “omp.h”
/* traverse elements in the list */
125
/* Tree traverse using tasks*/
struct node{
struct node *left, *right;
};
void traverse(struct node *p, int postorder){
if(p->left != NULL)
#pragma omp task
traverse(p->left, postorder);
if(p->right != NULL)
#pragma omp task
traverse(p->right, postorder);
if(postorder){
#pragma omp taskwait
}
process(p);
}
126
Task Data Scope
127
int a;
void foo(){
int b, c;
#pragma omp parallel shared (c)
{
int d;
# pragma omp task
{
int e;
/*
a = shared
b = firstprivate
c = shared
d = firstprivate
e = private
*/
}
}
128
Task Synchronization
129
Task Execution Model
130
#include “omp.h”
#include “omp.h” /* traverse elements in the list */
/* traverse elements in the list */ List *L;
List *L; …
… #pragma omp parallel
#pragma omp parallel #pragma omp single
traverse_list(L); traverse_list(L);
Multiple traversals of
Single traversal:
the same list
• One thread enters single
and creates all tasks
• All the team cooperates
executing them
131
#include “omp.h”
/* traverse elements in the list */
List L[N];
…
#pragma omp parallel for
For (i = 0; i < N; i++)
traverse_list(L[i]);
Multiple traversals:
• Multiple threads create tasks
• All the team cooperates executing them
132
Hybrid MPI/OpenMP
• Vector mode: MPI is called only outside OpenMP parallel regions.
133
Interconnection Network
Interconnection Network
P P P P Pt t t t
P P P P Pt t t t
P P P P Pt t t t
P P P P Pt t t t
C+MPI
(a) C+MPI+OpenMP
(b)
134
Basic Hybrid Framework
…
#pragma omp master
{
if(0== my_rank)
// some MPI call as root process
else
// some MPI call as non-root process
} // end of omp master
136
137
Concept 2: Master OpenMP Thread Controls
Communication
• Each MPI process uses its own OpenMP master thread to
communicate.
• Need to take more care to ensure efficient
communications.
…
#pragma omp master
{
some MPI call as an MPI process
} // end of omp master
138
139
Concept 3: All OpenMP Threads May Use MPI
Calls
• This is by far the most flexible communication scheme.
• Great care must be taken to account for explicitly which thread of which
MPI process communicates.
• Requires an addressing scheme that denotes which MPI process
participates in communication and which thread of MPI process is
involved, e.g., <my_rank, omp_thread_id>.
• Neither MPI nor OpenMP have built-in facilities for tracking
communication.
• Critical sections may be used for some level of control.
…
#pragma omp critical
{
some MPI call as an MPI process
} // end of omp critical
140
141
Conjugate Gradient
• Algorithm
– Start with MPI program
– MPI_Send/Recv for communication
– OpenMP “for” directive for matrix-vector multiplication
142
#include <stdlib.h>
#include <stdio.h>
#include “MyMPI.h”
int main(int argc, char *argv[]){
double **a, *astorage, *b, *x;
int p, id, m, n, nl;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
read_block_row_matrix(id,p,argv[1],(void*)(&a),(void*)(&astorage),MPI_DOUBLE,&m,&n);
nl = read_replicated_vector(id,p,argv[2],(void**)(&b),MPI_DOUBLE);
if((m!=n) ||(n != nl)) {
printf(“Incompatible dimensions %d %d time %d\n”, m,n,nl);
}
else{
x = (double*)malloc(n*sizeof(double));
cg(p,id,a,b,x,n);
print_replicated_vector(id,p,x,MPI_DOUBLE,n);
}
MPI_Finalize();
}
143
#define EPSILON 1.0e-10
Double *piece;
cg(int p, int id, double **a, double *b, double *x, int n){
int i, it;
double *d, *g, denom1, denom2, num1, num2, s, *tmpvec;
d = (double*)malloc(n*sizeof(double));
g = (double*)malloc(n*sizeof(double));
tmpvec = (double*)malloc(n*sizeof(double));
piece = (double*)malloc(BLOCK_SIZE(id,p,n)*sizeof(double));
for(i=0; i<n; i++){
d[i] = x[i] = 0.0;
g[i] = -b[i];
}
for(it=0; it<n; it++){
denom1 = dot_product(g,g,n);
matrix_vector_product(id,p,n,a,x,g);
for(i=0;i<n;i++) g[i]-=b[i];
num1 = dot_product(g,g,n);
if(num1<EPSILON) break;
for(i=0;i<n;i++) d[i]=-g[i]+(num1/denom1)*d[i];
num2 = dot_product(d,g,n);
matrix_vector_product(id,p,n,a,d,tmpvec);
denom2=dot_product(d,tmpvec,n);
s=-num2/denom2;
for(i=0;i<n;i++) x[i] += s*d[i];
}
} 144
double dot_product(double *a, double *b, int n)
{
int i;
double answer=0.0;
for(i=0; i<n;i++)
answer+=a[i]*b[i];
return answer;
}
double matrix_vector_product(int id, int p, int n, double **a, double *b, double *c){
int i, j;
double tmp;
#pragma omp parallel for private (I,j,tmp)
for(i=0; i<BLOCK_SIZE(id,p,n);i++){
tmp=0.0;
for(j=0;j<n;j++)
tmp+=a[i][j]*b[j];
piece[i] = tmp;
}
new_replicate_block_vector(id,p,piece,n, c, MPI_DOUBLE);
}
void new_replicate_block_vector(int id, int p, double *piece, int n, double *c, MPI_Datatype dtype)
{
int *cnt, *disp;
create_mixed_xfer_arrays(id,p,n,&cnt,&disp);
MPI_Allgatherv(piece,cnt[id], dtype, c, cnt, disp, dtype, MPI_COMM_WORLD);
}
145
Steady-State Heat Distribution
146
• Use row-decomposition.
int find_steady_state(int p, int id, iny my_rows, double **u, double **w)
{
double diff, global_diff, tdiff; int its;
MPI_Status status; int i,j;
its = 0;
for(;;) {
if(id>0) MPI_Send(u[1], N, MPI_DOUBLE, id-1,0,MPI_COMM_WORLD);
if(id < p-1) {
MPI_Send(u[my_rows-2],N,MPI_DOUBLE,id+1,0,MPI_COMM_WORLD);
MPI_Recv(u[my_rows-1],N,MPI_DOUBLE,id+1,0,MPI_COMM_WORLD,&status);
}
if(id>0) MPI_Recv(u[0],N,MPI_DOUBLE,id-1,0,MPI_COMM_WORLD,&status);
diff = 0.0;
#pragma omp parallel private (I,j,tdiff)
{
tdiff = 0.0;
#pragma omp for
for(i=1;i<my_rows-1;i++)
for(j=1;j<N-1;j++){
w[i][j]=(u[i-1][j]+u[i+1][j]+u[i][j-1]+u[i][j+1])/4.0;
if(fabs(w[i][j]-u[i][j]) >tdiff) tdiff = fabs(w[i][j]-u[i][j]);
}
#pragma omp for nowait
for(i=1;i<my_rows-1;i++)
for(j=1;j<N-1;j++)
u[i][j] = w[i][j];
#pragma omp critical
if(tdiff > diff) diff = tdiff;
}
MPI_Allreduce(&diff,&global_diff,1,MPI_DOUBLE,MPI_MAX,MPI_COMM_WORLD);
if(global_diff <= EPSILON) break;
its++;
}
147
}
OpenMP multithreading in MPI
• MPI-2 specification
– Does not mandate thread support
– Does define what a “thread compliant MPI” should do
– 4 levels of thread support
• MPI_THREAD_SINGLE: There is no OpenMP multithreading in the
program.
• MPI_THREAD_FUNNELED: All of the MPI calls are made by the master
thread.
This will happen if all MPI calls are outside OpenMP parallel regions or are in master
regions.
A thread can determine whether it is the master thread by calling
MPI_Is_thread_main
148
• MPI_THREAD_SERIALIZED: Multiple threads make MPI calls,
but only one at a time.
149
• Threaded MPI Initialization
Instead of starting MPI by MPI_Init,
int MPI_Init_thread(int *argc, char ***argv, int
required, int *provided)
required: the desired level of thread support.
provided: the actual level of thread support provided by the
system.
Thread support at levels MPI_THREAD_FUNNELED or higher
allows potential overlap of communication and computation.
http://www.mpi-forum.org/docs/mpi-20-html/node165.htm
150
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#include "omp.h"
151
References:
– http://bisqwit.iki.fi/story/howto/openmp/
– http://openmp.org/mp-documents/omp-hands-on-
SC08.pdf
– https://computing.llnl.gov/tutorials/openMP/
– http://www.mosaic.ethz.ch/education/Lectures/hpc
– R. van der Pas. An Overview of OpenMP
– B. Chapman, G. Jost and R. van der Pas. Using OpenMP:
Portable Shared Memory Parallel Programming. The MIT
Press, Cambridge, Massachusetts, London, England
– B. Estrade, Hybrid Programming with MPI and OpenMP
152