Professional Documents
Culture Documents
3unit3 Mca Pecnotes
3unit3 Mca Pecnotes
I. OpenMP: Introduction
OpenMP stands for Open Multi – Processing/Open specifications for Multi-
processing
An Application Programming Interface (API) for developing parallel programs for
shared memory architectures
Three primary components of the API are
Compiler Directives
Runtime Library Routines
Environment Variables
Standard specifies C, C++, and FORTRAN Directives & API
Provides a platform-independent set of compiler pragmas, directives, function calls,
and environment variables that explicitly instruct the compiler how and where to use
parallelism in the application
1
Master thread executes in serial mode until the parallel region construct is
encountered
When a thread reaches a PARALLEL directive, it creates a team of threads and
becomes the master of the team. The master is a member of that team and has
thread number 0 within that team.
After executing the statements in the parallel region, team threads synchronize and
terminate (join) but master continues.
2
OpenMP Memory Model: Basic Terms
OpenMP threads have a place for storing and retrieving data that is available to all threads,
called the memory (i.e, All threads share an address space)
In addition, each thread is allowed to have its own temporary view of the memory
The temporary view of memory allows the thread to cache variables and thereby to
avoid going to memory for every reference to a variable.
Each thread also has access to another type of memory that must not be
accessed by other threads, called threadprivate memory.
Data can move between memory and a thread's temporary view, but can never
move between temporary views directly, without going through memory.
two aspects of memory system behavior relating to shared memory parallel
programs: coherence and consistency (i.e., A memory model is defined in terms
of)
Coherence: Behavior of the memory system when a single address(i.e., memory
location) is accessed by multiple threads.
Consistency: Orderings of accesses to different addresses by multiple threads/ ordering
of accesses to different memory locations, observable from various threads in the
system.
For coherence
OpenMP doesn't specify any coherence behavior of the memory system. It is left
to the base language and computer system.
OpenMP does not guarantee anything about the result of memory operations that
constitute data races within a program
For consistency
OpenMP does guarantee certain consistency behavior, however. That behavior is
based on the OpenMP flush operation
Each variable used within a parallel region is either shared or private.
3
Each variable referenced in the parallel structured block has an original variable,
which is the variable by the same name that exists in the program immediately
outside the construct.
Each reference to a shared variable in the structured block becomes a reference to
the original Variable (i.e., has same address in execution context of every thread)
For each private variable referenced in the structured block, a new version of the
original variable (of the same type and size) is created in memory for each task(i.e.,
has different address in execution context of every thread
A thread cannot access the private variable of another thread
Creation of the new version does not alter the value of the original variable.
However, the impact of attempts to access the original variable during the region
associated with the directive is unspecified.
References to a private variable in the structured block refer to the current task’s
private version of the original variable
3.1 Flush directive/ Operation
The memory model has relaxed-consistency because a thread’s temporary view of memory
is not required to be consistent with memory at all times.
OpenMp standard specifies that all modifications are written back to main
memory and are thus available to all threads at synchronization points in the
program.
Between these synchronization points threads are permitted to have new values
for shared variables stored in their local memory rather than in the global shared
memory.
Each thread executing an openMP code potentially has its own temporary view of
the values of shared data. This approach , called a relaxed consistency model
or
value written to a variable can remain in the thread’s temporary view until it is
forced to memory at a later time.
a read from a variable may retrieve the value from the thread’s temporary view,
unless it is forced to read from memory.
But sometimes updated values of shared values must become visible to other
threads in-between synchronization points. This is done by flush directive
4
The flush operation is applied to a set of variables called the flush-set.
If no list is provided, it applies to all thread-visible shared data.
If the flush operation is invoked by a thread that has updated the variables, their
new values will be flushed to memory and therefore be accessible to all other
threads
If the construct is invoked by a thread that has not updated a value, it will ensure
that any local copies of the data are replaced by the latest value from main memory
Implicit flush operation with no list occur at the following location
All explicit and implicit barriers(e.g., at the end of a parallel region or
work-sharing constructs)
Entry to and exit from critical region
Entry to and exit from lock routines
5
copyin(list)
reduction({operator|intrinsic_procedure_name}:list)
When a thread encounters a parallel construct, a team of threads is created to
execute the parallel region.
The thread that encountered the parallel construct becomes the master thread of the
new team, with a thread number of zero for the duration of the new parallel region.
All threads in the new team, including the master thread, execute the region.
Once the team is created, the number of threads in the team remains constant for
the duration of that parallel region.
Within a parallel region, thread numbers uniquely identify each thread.
Thread numbers are consecutive whole numbers ranging from zero for the master
thread up to one less than the number of threads in the team.
A thread may obtain its own thread number by a call to the omp_get_thread_num
library routine.
If a thread in a team executing a parallel region encounters another parallel
directive, icreates a new team, and it becomes the master of that new team.
Each thread has a unique integer “id”; master thread has “id” 0, and other threads
have “id” 1, 2, …
6
OpenMP runtime function omp_get_thread_num() returns a thread’s unique “id”.
The function omp_get_num_threads() returns the total number of executing threads
The function omp_set_num_threads(x) asks for “x” threads to execute in the next
parallel region (must be set outside region)
7
omp parallel for directive is the combination of the omp parallel and omp for
directives.
This directive tells to define a parallel region containing a single for directive in one
step
parallel for pragma splits the iterations of a loop across multiple threads in the
parallel region
Example-1
Example-2
8
Intel C++ and Fortran compilers support all four scheduling schemes
Syntax
9
Most useful, when the different iterations in the loop may take different time to
execute.
Example, if the chunk size is specified as 16 with the schedule(dynamic,16)
clause,the total number of iterations is 100, the partition would be
[16,16,16,16,16,16,4].
Eg1:
Eg2:
export OMP_SCHEDULE=dynamic,16
10
5.2.2 Reductions
Loops that reduce a collection of values to a single value are called as reduction
peration
Example for reduction operation
sum = 0;
for ( k = 0; k < 100; k++ ){
sum = sum + func(k); // func has no side-effects
Effective use of Reductions
To synchronize OpenMP provides reduction clause
sum = 0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < 100; k++) {
sum = sum + func(k);}
For each variable specified in a reduction clause, a private copy is created for
each thread.
The private copy is then initialized to the initialization value for the operator.
For the above program the compiler creates private copies of the variable sum
for each thread, and when the loop completes, it adds the values together and
places the result in the original variable.
5.2.3 Clauses
5.2.3.1 OpenMP : Private Clause
private(variables-list)
int B;
B=10;
#pragma omp parallel
private)
B=…;
firstprivate(variables-list)
11
The variables are scoped as private within the parallel region .
All variables in the list are initialized with the value the original object had before
entering the parallel construct.
lastprivate(variables-list)
The value of B in the thread that executed the sequentially last iteration(i=n-1) is
copied to the master thread’s serial copy of B
5.2.7.4 OpenMP: Thread Private Clause
Thread private(variables-list)
12
5.2.3.5 OpenMP :For Directive caveat(warning/requirement/limitation)
OpenMP will only parallelize for loops. It won’t parallelize while loops or do while
loops
OpenMP parallelizes for loops that are in canonical form where the number of
iterations can be determined from the for statement itself prior to execution of
the loop.
Conditions in the definition of canonical for loops are sufficient for deciding on the
number of iteration prior to execution.
Loops in canonical form take one of the forms
i) “infinite loop”
for ( ; ; ) { ...}
ii) This loop cannot be parallelized , since the number of iterations can’t be
determined from the for statement alone. This for loop is also not a structured
block, since the break adds another point of exit from the loop.
for (i = 0; i < n; i++)
{
if ( . . . ) break;
...
}
iii) No Data dependences
A task K is said to be data-dependent on task J if K needs data generated by J. Eg.,
J: foo = bar + 3;
K: lama = foo + 3;
Note that this dependency is ordered: the dependent task k can neither run
before j nor in parallel with j.
The above statements cannot be parallelized
iV). NO loop-carried dependency (can be avoided using Strip-mining)
13
Execution of Statement S1 that references L occurs before the execution of s2 that
references L
E.g., a program which contains loop carried dependence
//it will fail due to loop-carried dependencies.
x[0] = 0;
y[0] = 1;
#pragma omp parallel for private(k)
for ( k = 1; k < 100; k++ ) {
x[k] = y[k-1] + 1; // S1
y[k] = x[k-1] + 2; // S2
}
Write operation is performed on location x[k] at iteration k and a read operation to
the same memory location at k+1th iteration.
read operation is performed on location y[k-1] at iteration k and a write operation to
the same memory location at k+1th iteration
The above statements cannot be parallelized
5.3 OpenMP:Section Directive
is a non-iterative work-sharing construct.
It specifies that the enclosed section(s) of code are to be divided among the
threads in the team.
Independent SECTION directives are nested within a SECTIONS directive. Each
SECTION is executed once by a thread in the team
Syntax
14
The sections are divided among the threads
Therefore each thread is executed exactly once in parallel
If program contains more section than threads, remaining sections are assigned
once previous threads finish its work.
Schedule clause is not defined for sections, since OpenMP know how, when and in
what order threads are scheduled to execute the sections.
Restrictions:
It is illegal to branch into or out of section blocks.
5.4 OpenMP: Single Directive
specifies that the enclosed code is to be executed by only one thread in the team.
useful when sections of code that are not thread safe (such as I/O)
Syntax
Threads in the team that do not execute the SINGLE directive, wait at the end of the
enclosed code block, unless a NOWAIT/nowait clause is specified.
5.5 Synchronization Directive
OpenMP has the following constructs to support synchronization
1. Barrier
2. Single
3. critical section
4. Master
5.5.1 Barrier
synchronizes all threads in the team (Or) Each thread waits until all threads
arrive
All threads in a team (or none) must execute the BARRIER region.
Syntax :
5.5.2. Single
specifies that the enclosed code is to be executed by only one thread in the team.
useful when sections of code that are not thread safe (such as I/O)
Syntax
Threads in the team that do not execute the SINGLE directive, wait at the end of
the enclosed code block, unless a NOWAIT/nowait clause is specified.
15
5.5.3. Critical Section
specifies a region of code that must be executed by only one thread at a time
(i.e., Only one thread at a time can enter a critical section)
If a thread is currently executing inside a CRITICAL region and another thread
reaches that CRITICAL region and attempts to execute it, it will block until the
first thread exits that CRITICAL region
Syntax
E.g.,
float res;
#pragma omp parallel
{
#pragma omp for
for(i=0;i<niters;i++)
{
float B = big_job(i);
#pragma omp critical
consum (B, RES);
}}
5.5.4. MASTER Directive
Specifies a region that is to be executed only by the master thread of the team. All
other threads on the team skip this section of code
Syntax : #pragma omp master {structured_block}
Example
#pragma omp parallel
{
#pragma omp for nowait
for ( k = 0; k < 100; k++ ) x[k] = f1(tid);
#pragma omp master
y = f();// only the master thread calls this
#pragma omp barrier // again, this loop is divided among the threads
#pragma omp for nowait
for ( k = 0; k < 100; k++ ) x[k] = y + f2(x[k]);
#pragma omp single
fn_single_print(y);
#pragma omp master
fn_print_array(x); // only one of threads prints x[]
}
16
OMP_NUM_THREADS sets the default Number of export export
number of threads processors OMP_NUM_THR
to use during EADS=value[,v OMP_NUM_THR
execution, unless alue]* EADS
that number is
explicitly changed
by calling the
omp_set_num_thre
ads library routine)
or by an explicit
num_threads clause
on a parallel
directive
OMP_DYNAMIC Specifies whether FALSE export OMP_DYNAMIC
the OpenMP run OMP_DYNAMIC =true
time can adjust the =value
number of threads
in a parallel region
#include <omp.h>
These funcs. affect and monitor threads, processors, and the parallel environment
1. omp_set_num_threads 5. omp_get_num_procs
2. omp_get_num_threads 6. omp_in_parallel
3. omp_get_max_threads 7. omp_set_dynamic
4. omp_get_thread_num 8. omp_get_dynamic
1. omp_set_num_threads
sets the default number of threads to use for subsequent parallel regions that do
not specify a num_threads clause
Syntax :
17
2. omp_get_num_threads
returns the number of threads in the team executing the parallel region from which
it is called.
When called from a serial region, this function returns 1.
By default, the value returned by this function is equal to the value of the
environment variable OMP_NUM_THREADS or to the value set by the last previous
call to the omp_set_num_threads() function
Syntax :
int omp_get_num_threads(void);
3.omp_get_max_threads
returns the actual number of threads in the current team of threads.
Syntax
int omp_get_max_threads(void);
4. omp_get_thread_num
returns the ID of a thread.
The thread number lies between 0 and omp_get_num_threads()–1.
The thread with the ID of 0 is the master thread
Syntax:
int omp_get_thread_num(void);
Example
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void main ()
{
int nthreads, tid, maxt;
omp_set_num_threads(3); //tell OpenMP to use 3 threads in parallel regions
#pragma omp parallel // Start parallel region
{
tid = omp_get_thread_num(); Output:
if (tid == 0)
{ Thread 0 getting environment
printf("Thread %d getting environment info...\n", tid); info...
//create a local variable for each thread Number of threads =3
nthreads = omp_get_num_threads(); Max threads =3
maxt = omp_get_max_threads();
printf("Number of threads = %d\n", nthreads);
printf("Max threads = %d\n", maxt);
}
}
}
18
5. omp_get_num_procs
returns the number of processors that are available to the program at the time the
function is called.
Syntax: int omp_get_num_procs(void);
6. omp_in_parallel
called to determine if the section of code which is executing is parallel or not.
returns a nonzero value
Syntax: int omp_in_parallel(void);
7. omp_set_dynamic
enables or disables dynamic adjustment of the number of threads available for
execution of parallel regions.
Syntax: void omp_set_dynamic(int dynamic_threads);
dynamic_threads = nonzero value => the number of threads that are used for
executing subsequent parallel regions may be adjusted automatically by the run-
time environment to best utilize system resources.
dynamic_threads = 0, dynamic adjustment is disabled.
8. omp_get_dynamic
returns a nonzero value if dynamic adjustment of threads is enabled, and returns
0 otherwise
Syntax : int omp_get_dynamic(void);
Example
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void main ()
{
int tid, procs, inpar, dynamic;
#pragma omp parallel /* Start parallel region */
{
tid = omp_get_thread_num(); /* Obtain thread number */
if (tid == 0) /* Only master thread does this */
{ Output
printf("Thread %d getting environment info...\n", tid); Thread 0 getting environment
info...
/* Get environment information */ Number of processors =3
procs = omp_get_num_procs(); In Parallel?=1
inpar = omp_in_parallel(); Dynamic threads enables?=0
dynamic = omp_get_dynamic();
19
VIII. OpenMP : Handling Data and Functional Parallelism
Alpha,beta and
delta can be computed in parallel
The above code is written using section construct to achieve functional parallelism
20
#pragma omp section /* This pragma optional */
V=alpha();
#pragma omp section
W=beta();
}
#pragma omp sections
{
#pragma omp section /* This pragma optional */
X=gamma(v,w);
#pragma omp section
Y=delta();
}
}
Printf(“%6.2f\n”,epsilon(x,y));
for(i=1;i<m;i++)
for(j=0;j<n;j++)
a[i][j]=2*a[i-1][j]
21
two rows may not be updated simultaneously, because there are data
dependences between rows, but the columns may be updated simultaneously.
That is the loop indexed by j may be executed in parallel, but not the loop indexed by
i.
Inserting a parallel for pragma before the inner loop, the resulting parallel program
will execute correctly, because it requires m-1 fork/join steps, one per iteration of
the outer loop.
However, if we invert the loops, only a single fork/join step is required.
the data dependences have not changed, the iterations of the loop indexed by j are
still independent of each other.
10.3 Scheduling loops (refer 5.2.1)
Note : Refer Scheduling concept in for constructs
10.4 Conditionally execution loops
If a loop does not have enough iterations, the time spent forking and joining threads
may exceed the time saved by dividing the loop iterations among multiple threads.
for example
Sum=0.0
#pragma omp parallel for private(x) reduction (+:area)
for(i=0;i<n;i++)
area+=i;
avg=sum/n;
The if clause gives us the ability to direct the compiler to insert code that determines
at run-time whether the loop should be executed in parallel or sequentially.
The clause has this syntax: if(scalar expression)
22
If the scalar expression evaluates to true, the loop will be executed in parallel.
Otherwise, it will be executed serially.
if clause to the parallel for pragma in the parallel program computing area.
#pragma omp parallel for private(x) reduction (+:area) if(n>5000)
for(i=0;i<n;i++)
area+=i;
avg=sum/n;
above loop iterations will be divided among multiple threads only if n>5000.
23