Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

UNIT III SHARED MEMORY PROGRAMMING WITH OpenMP

OpenMP Execution Model – Memory Model – OpenMP Directives – Work-sharing Constructs –


Library functions – Handling Data and Functional Parallelism – Handling Loops – Performance
Considerations.

I. OpenMP: Introduction
 OpenMP stands for Open Multi – Processing/Open specifications for Multi-
processing
 An Application Programming Interface (API) for developing parallel programs for
shared memory architectures
 Three primary components of the API are
 Compiler Directives
 Runtime Library Routines
 Environment Variables
 Standard specifies C, C++, and FORTRAN Directives & API
 Provides a platform-independent set of compiler pragmas, directives, function calls,
and environment variables that explicitly instruct the compiler how and where to use
parallelism in the application

1.1 OpenMP: Components

II. OpenMP : Execution Model/Fork - Join Parallelism


 The OpenMP API is intended to support programs that will execute correctly both as
parallel programs (multiple threads of execution and a full OpenMP support library)
and as sequential programs (directives ignored and a simple OpenMP stubs library
 The OpenMP API uses the fork-join model of parallel execution
 Team := Master + Workers
 Programs begin as a single process: master thread ,always has thread ID 0

1
 Master thread executes in serial mode until the parallel region construct is
encountered
 When a thread reaches a PARALLEL directive, it creates a team of threads and
becomes the master of the team. The master is a member of that team and has
thread number 0 within that team.
 After executing the statements in the parallel region, team threads synchronize and
terminate (join) but master continues.

III. OpenMP : Memory Model


 OpenMP provides a relaxed-consistency, shared-memory model.

2
OpenMP Memory Model: Basic Terms

 OpenMP threads have a place for storing and retrieving data that is available to all threads,
called the memory (i.e, All threads share an address space)
 In addition, each thread is allowed to have its own temporary view of the memory
 The temporary view of memory allows the thread to cache variables and thereby to
avoid going to memory for every reference to a variable.
 Each thread also has access to another type of memory that must not be
accessed by other threads, called threadprivate memory.
 Data can move between memory and a thread's temporary view, but can never
move between temporary views directly, without going through memory.
 two aspects of memory system behavior relating to shared memory parallel
programs: coherence and consistency (i.e., A memory model is defined in terms
of)
Coherence: Behavior of the memory system when a single address(i.e., memory
location) is accessed by multiple threads.
Consistency: Orderings of accesses to different addresses by multiple threads/ ordering
of accesses to different memory locations, observable from various threads in the
system.
For coherence
 OpenMP doesn't specify any coherence behavior of the memory system. It is left
to the base language and computer system.
 OpenMP does not guarantee anything about the result of memory operations that
constitute data races within a program
For consistency
 OpenMP does guarantee certain consistency behavior, however. That behavior is
based on the OpenMP flush operation
 Each variable used within a parallel region is either shared or private.

3
 Each variable referenced in the parallel structured block has an original variable,
which is the variable by the same name that exists in the program immediately
outside the construct.
 Each reference to a shared variable in the structured block becomes a reference to
the original Variable (i.e., has same address in execution context of every thread)
 For each private variable referenced in the structured block, a new version of the
original variable (of the same type and size) is created in memory for each task(i.e.,
has different address in execution context of every thread
 A thread cannot access the private variable of another thread
 Creation of the new version does not alter the value of the original variable.
 However, the impact of attempts to access the original variable during the region
associated with the directive is unspecified.
 References to a private variable in the structured block refer to the current task’s
private version of the original variable
3.1 Flush directive/ Operation
 The memory model has relaxed-consistency because a thread’s temporary view of memory
is not required to be consistent with memory at all times.

 OpenMp standard specifies that all modifications are written back to main
memory and are thus available to all threads at synchronization points in the
program.
 Between these synchronization points threads are permitted to have new values
for shared variables stored in their local memory rather than in the global shared
memory.
 Each thread executing an openMP code potentially has its own temporary view of
the values of shared data. This approach , called a relaxed consistency model
or
 value written to a variable can remain in the thread’s temporary view until it is
forced to memory at a later time.
 a read from a variable may retrieve the value from the thread’s temporary view,
unless it is forced to read from memory.
 But sometimes updated values of shared values must become visible to other
threads in-between synchronization points. This is done by flush directive

 Flush directive/ Operation


 The OpenMP flush operation enforces consistency between the temporary view and
memory.
Syntax

#pragma omp flush[(list)]

4
 The flush operation is applied to a set of variables called the flush-set.
 If no list is provided, it applies to all thread-visible shared data.
 If the flush operation is invoked by a thread that has updated the variables, their
new values will be flushed to memory and therefore be accessible to all other
threads
 If the construct is invoked by a thread that has not updated a value, it will ensure
that any local copies of the data are replaced by the latest value from main memory
 Implicit flush operation with no list occur at the following location
 All explicit and implicit barriers(e.g., at the end of a parallel region or
work-sharing constructs)
 Entry to and exit from critical region
 Entry to and exit from lock routines

IV. OpenMP : Directives


 OpenMP directives for C/C++ are specified with the pragma preprocessing
directive
 C/C++:

#pragma omp directive-name [clause[clause]…]] new-line

 The ‘#pragma’ directive is the method specified by the C standard for


providing additional information to the compiler. Stands for pragmatic
information
 directives are case sensitive. For Continuation: use \ in pragma.
 Every thread executes the same statements which are inside the parallel
region simultaneously.
 At the end of the parallel region there is an implicit barrier for
synchronization .
4.1 Parallel constructs
Syntax

#pragma omp directive-name [clause[clause]…]] new-line


Structured- block

E.g., : #pragma omp parallel


{ codes that executes parallely }
where clause is one of the following
 if(scalar-logical-expression)
 num_threads(scalar-integer-expression)
 default(private | firstprivate | shared | none)
 private(list)
 firstprivate(list)
 shared(list)

5
 copyin(list)
 reduction({operator|intrinsic_procedure_name}:list)
 When a thread encounters a parallel construct, a team of threads is created to
execute the parallel region.
 The thread that encountered the parallel construct becomes the master thread of the
new team, with a thread number of zero for the duration of the new parallel region.
 All threads in the new team, including the master thread, execute the region.
 Once the team is created, the number of threads in the team remains constant for
the duration of that parallel region.
 Within a parallel region, thread numbers uniquely identify each thread.
 Thread numbers are consecutive whole numbers ranging from zero for the master
thread up to one less than the number of threads in the team.
 A thread may obtain its own thread number by a call to the omp_get_thread_num
library routine.
 If a thread in a team executing a parallel region encounters another parallel
directive, icreates a new team, and it becomes the master of that new team.

 Each thread has a unique integer “id”; master thread has “id” 0, and other threads
have “id” 1, 2, …

6
 OpenMP runtime function omp_get_thread_num() returns a thread’s unique “id”.
 The function omp_get_num_threads() returns the total number of executing threads
 The function omp_set_num_threads(x) asks for “x” threads to execute in the next
parallel region (must be set outside region)

V. OpenMP : Work-Sharing Constructs


 A work-sharing construct divides the execution of the enclosed code region among
the members of the team (i.e., they split the work)
 Work-sharing constructs do not launch new threads
 There is no barrier upon entry to a work-sharing construct, but, there is an implied
barrier at the end of a work sharing construct
Restrictions
 A work-sharing construct must be enclosed dynamically within a parallel region in
order for the directive to execute in parallel
 Work-sharing constructs must be encountered by all members of a team or none at
all
5.1 Types of Work-Sharing Constructs
 FOR - shares iterations of a loop across the team. Represents a type of “data
parallelism”.
 SECTIONS - breaks work into separate, discreet sections. Each section is executed
by a thread. Can be used to implement a type of "functional parallelism“.
 SINGLE - define a section of code where exactly one thread is allowed to execute
the code; threads not chosen to execute this section ignore the code .

5.2 OpenMP : FOR Directive


Syntax #pragma omp for [clause[[,] clause] ... ]

The clause can be


 private( variable-list )
 firstprivate( variable-list )
 lastprivate( variable-list )
 reduction( operator : variable-list)
 Ordered
 schedule( kind[, chunk_size]) ,nowait

7
 omp parallel for directive is the combination of the omp parallel and omp for
directives.
 This directive tells to define a parallel region containing a single for directive in one
step
 parallel for pragma splits the iterations of a loop across multiple threads in the
parallel region
Example-1

Example-2

5.2.1 OpenMP : FOR Directive Schedule Clause


 schedule clause specifies how iterations of the for loop are divided among
threads of the team
 OpenMP offers four scheduling schemes/ schedule kind . They are

8
 Intel C++ and Fortran compilers support all four scheduling schemes
Syntax

5.2.1.1 Static-even scheduling


 This is the default scheduling technique
 Loop iterations are divided into pieces of size chunk and then statically assigned
to threads
 In absence of chunk size iterations are evenly (if possible) divided contiguously
among the threads.
 Pre-determined and predictable by the programmer
 Least work at runtime: scheduling done at compile-time
 E.g., if m iterations and N threads are there in the team, each thread gets m/N
Iterations
 Example : with threads 4

5.2.1.2 Dynamic scheduling


 Fixed portions of work; size is controlled by the value of chunk
 Each time, each thread grab iterations equal to the chunk size specified in
theschedule clause, except the last chunk.
 When a thread finishes one chunk ,it is dynamically assigned another . Default
chunk size is 1.
 The chunks are handled with the first-come, first-serve scheme
 Most work at runtime: complex scheduling logic used at run-time

9
 Most useful, when the different iterations in the loop may take different time to
execute.
 Example, if the chunk size is specified as 16 with the schedule(dynamic,16)
clause,the total number of iterations is 100, the partition would be
[16,16,16,16,16,16,4].
Eg1:

Eg2:

5.2.1.3 Guided Scheduling


 Special case of dynamic to reduce scheduling overhead
 The size of the block starts large and shrinks down to size chunk as the calculation
proceeds
 Default chunk size is 1
 The partitioning of a loop is done based on the following formula with start value of
β0 = number of loop iterations.
Πk=βk/(2*N)

where N=> is the number of threads


Πk => denotes the size of the kth chunk
βk => number of remaining unscheduled loop iterations.
S=> represents chunk size
 Example consider the given loop size is β0 = 800, N=2, S= 80, the loop partition is
[200,150,113,85,80,80,80,12]

5.2.1.4 Runtime Scheduling


 is not a scheduling scheme.
 When runtime is specified in the schedule clause, Iteration scheduling scheme is
 set at runtime through environment variable OMP SCHEDULE
 format of OMP_SCHEDULE environment variable is schedule-type[, chunk-size]
Example :

export OMP_SCHEDULE=dynamic,16
10
5.2.2 Reductions
 Loops that reduce a collection of values to a single value are called as reduction
peration
 Example for reduction operation
sum = 0;
for ( k = 0; k < 100; k++ ){
sum = sum + func(k); // func has no side-effects
Effective use of Reductions
 To synchronize OpenMP provides reduction clause

sum = 0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < 100; k++) {
sum = sum + func(k);}

 For each variable specified in a reduction clause, a private copy is created for
each thread.
 The private copy is then initialized to the initialization value for the operator.
 For the above program the compiler creates private copies of the variable sum
for each thread, and when the loop completes, it adds the values together and
places the result in the original variable.
5.2.3 Clauses
5.2.3.1 OpenMP : Private Clause

private(variables-list)

 All references are to the local object .


 Private variables are undefined on entry and exit of the parallel region
 The value of the original variable (before the parallel region) is undefined after
the parallel region
 A private variable within a parallel region has no storage association with the
same variable outside of the region
 A private uninitialized copy of B is created before the parallel region begins
 B is not the same within the parallel region as outside
 Use the first/last private clause to override this behaviour

int B;
B=10;
#pragma omp parallel
private)
B=…;

5.2.3.2 OpenMP : FirstPrivate Clause

firstprivate(variables-list)
11
 The variables are scoped as private within the parallel region .
 All variables in the list are initialized with the value the original object had before
entering the parallel construct.

 A private initialized copy of t is created before the parallel region begins


 The copy of each thread gets the same value
5.2.3.3 OpenMP : LastPrivate Clause

lastprivate(variables-list)

 The variables are scoped as private within the parallel region


 Writes back to the master’s copy the value contained in the private copy belonging to the
thread that executed the sequential last iteration of the loop.

 The value of B in the thread that executed the sequentially last iteration(i=n-1) is
copied to the master thread’s serial copy of B
5.2.7.4 OpenMP: Thread Private Clause

Thread private(variables-list)

 Each thread gets an initialized private copy of B


 The value of B is retained between parallel regions

12
5.2.3.5 OpenMP :For Directive caveat(warning/requirement/limitation)
 OpenMP will only parallelize for loops. It won’t parallelize while loops or do while
loops
 OpenMP parallelizes for loops that are in canonical form where the number of
iterations can be determined from the for statement itself prior to execution of
the loop.
 Conditions in the definition of canonical for loops are sufficient for deciding on the
number of iteration prior to execution.
 Loops in canonical form take one of the forms

i) “infinite loop”
for ( ; ; ) { ...}
ii) This loop cannot be parallelized , since the number of iterations can’t be
determined from the for statement alone. This for loop is also not a structured
block, since the break adds another point of exit from the loop.
for (i = 0; i < n; i++)
{
if ( . . . ) break;
...
}
iii) No Data dependences
A task K is said to be data-dependent on task J if K needs data generated by J. Eg.,
J: foo = bar + 3;
K: lama = foo + 3;
 Note that this dependency is ordered: the dependent task k can neither run
before j nor in parallel with j.
 The above statements cannot be parallelized
iV). NO loop-carried dependency (can be avoided using Strip-mining)

 Data dependences between statements instances that belong to different loop


iterations are called loop-carried
 Statement S1 and S2 must refer same memory location L

13
 Execution of Statement S1 that references L occurs before the execution of s2 that
references L
 E.g., a program which contains loop carried dependence
 //it will fail due to loop-carried dependencies.
x[0] = 0;
y[0] = 1;
#pragma omp parallel for private(k)
for ( k = 1; k < 100; k++ ) {
x[k] = y[k-1] + 1; // S1
y[k] = x[k-1] + 2; // S2
}
 Write operation is performed on location x[k] at iteration k and a read operation to
the same memory location at k+1th iteration.
 read operation is performed on location y[k-1] at iteration k and a write operation to
the same memory location at k+1th iteration
 The above statements cannot be parallelized
5.3 OpenMP:Section Directive
 is a non-iterative work-sharing construct.
 It specifies that the enclosed section(s) of code are to be divided among the
threads in the team.
 Independent SECTION directives are nested within a SECTIONS directive. Each
SECTION is executed once by a thread in the team
Syntax

#pragma omp sections [clauses]


{
# pragma omp section { structured block }
# pragma omp section { structured block }
...
}
where clause can be
private (list)
firstprivate (list)
lastprivate (list)
reduction (operator: list), nowait

14
 The sections are divided among the threads
 Therefore each thread is executed exactly once in parallel
 If program contains more section than threads, remaining sections are assigned
once previous threads finish its work.
 Schedule clause is not defined for sections, since OpenMP know how, when and in
what order threads are scheduled to execute the sections.
Restrictions:
 It is illegal to branch into or out of section blocks.
5.4 OpenMP: Single Directive
 specifies that the enclosed code is to be executed by only one thread in the team.
 useful when sections of code that are not thread safe (such as I/O)
Syntax

#pragma omp single { structured_block }

 Threads in the team that do not execute the SINGLE directive, wait at the end of the
enclosed code block, unless a NOWAIT/nowait clause is specified.
5.5 Synchronization Directive
 OpenMP has the following constructs to support synchronization
1. Barrier
2. Single
3. critical section
4. Master
5.5.1 Barrier
 synchronizes all threads in the team (Or) Each thread waits until all threads
arrive
 All threads in a team (or none) must execute the BARRIER region.
 Syntax :

#pragma omp barrier

5.5.2. Single
 specifies that the enclosed code is to be executed by only one thread in the team.
 useful when sections of code that are not thread safe (such as I/O)
Syntax

#pragma omp single { structured_block }

 Threads in the team that do not execute the SINGLE directive, wait at the end of
the enclosed code block, unless a NOWAIT/nowait clause is specified.

15
5.5.3. Critical Section

 specifies a region of code that must be executed by only one thread at a time
(i.e., Only one thread at a time can enter a critical section)
 If a thread is currently executing inside a CRITICAL region and another thread
reaches that CRITICAL region and attempts to execute it, it will block until the
first thread exits that CRITICAL region
Syntax

#pragma omp critical

 E.g.,
float res;
#pragma omp parallel
{
#pragma omp for
for(i=0;i<niters;i++)
{
float B = big_job(i);
#pragma omp critical
consum (B, RES);
}}
5.5.4. MASTER Directive
 Specifies a region that is to be executed only by the master thread of the team. All
other threads on the team skip this section of code
 Syntax : #pragma omp master {structured_block}
Example
#pragma omp parallel
{
#pragma omp for nowait
for ( k = 0; k < 100; k++ ) x[k] = f1(tid);
#pragma omp master
y = f();// only the master thread calls this
#pragma omp barrier // again, this loop is divided among the threads
#pragma omp for nowait
for ( k = 0; k < 100; k++ ) x[k] = y + f2(x[k]);
#pragma omp single
fn_single_print(y);
#pragma omp master
fn_print_array(x); // only one of threads prints x[]
}

VI. OpenMP : Environment Variables

Variable Description Default Syntax Example


OMP_SCHEDULE Modifies the Static, no Export Export
behavior of the chunk size OMP_SCHEDUL OMP_SCHEDULE
schedule clause specified E=schedule- =”dynamic”
when type[,chunk- Or
schedule(runtime) size] Export
is specified in a for OMP_SCHEDULE
or parallel for =”dynamic”,16
directive.

16
OMP_NUM_THREADS sets the default Number of export export
number of threads processors OMP_NUM_THR
to use during EADS=value[,v OMP_NUM_THR
execution, unless alue]* EADS
that number is
explicitly changed
by calling the
omp_set_num_thre
ads library routine)
or by an explicit
num_threads clause
on a parallel
directive
OMP_DYNAMIC Specifies whether FALSE export OMP_DYNAMIC
the OpenMP run OMP_DYNAMIC =true
time can adjust the =value
number of threads
in a parallel region

OMP_NESTED Enables (true) or FALSE export export


disables OMP_NESTED= OMP_NESTED=f
(false)nested value alse
parallelism.
create new teams
of threads

OMP_THREAD_LIMIT Sets the number of number of export export


OpenMP threads to available OMP_NUM_THR OMP_THREAD_L
use for the whole processors EADS = value IMIT = 8
OpenMP program.
The value of this
environment
variable must be a
positive integer

VII. OpenMP :Run-Time Library Functions


 User-callable functions are available to the OpenMP C/C++ programmer to query
and alter the parallel execution environment.
 Any program unit that invokes these functions should include the statement

#include <omp.h>

 These funcs. affect and monitor threads, processors, and the parallel environment
1. omp_set_num_threads 5. omp_get_num_procs
2. omp_get_num_threads 6. omp_in_parallel
3. omp_get_max_threads 7. omp_set_dynamic
4. omp_get_thread_num 8. omp_get_dynamic
1. omp_set_num_threads
 sets the default number of threads to use for subsequent parallel regions that do
not specify a num_threads clause
Syntax :

void omp_set_num_threads(int num_threads);

17
2. omp_get_num_threads
 returns the number of threads in the team executing the parallel region from which
it is called.
 When called from a serial region, this function returns 1.
 By default, the value returned by this function is equal to the value of the
environment variable OMP_NUM_THREADS or to the value set by the last previous
call to the omp_set_num_threads() function
Syntax :

int omp_get_num_threads(void);

3.omp_get_max_threads
 returns the actual number of threads in the current team of threads.
Syntax

int omp_get_max_threads(void);

4. omp_get_thread_num
 returns the ID of a thread.
 The thread number lies between 0 and omp_get_num_threads()–1.
 The thread with the ID of 0 is the master thread
Syntax:

int omp_get_thread_num(void);

Example
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void main ()
{
int nthreads, tid, maxt;
omp_set_num_threads(3); //tell OpenMP to use 3 threads in parallel regions
#pragma omp parallel // Start parallel region
{
tid = omp_get_thread_num(); Output:
if (tid == 0)
{ Thread 0 getting environment
printf("Thread %d getting environment info...\n", tid); info...
//create a local variable for each thread Number of threads =3
nthreads = omp_get_num_threads(); Max threads =3
maxt = omp_get_max_threads();
printf("Number of threads = %d\n", nthreads);
printf("Max threads = %d\n", maxt);
}
}
}

18
5. omp_get_num_procs
 returns the number of processors that are available to the program at the time the
function is called.
Syntax: int omp_get_num_procs(void);

6. omp_in_parallel
 called to determine if the section of code which is executing is parallel or not.
 returns a nonzero value
Syntax: int omp_in_parallel(void);

7. omp_set_dynamic
 enables or disables dynamic adjustment of the number of threads available for
execution of parallel regions.
Syntax: void omp_set_dynamic(int dynamic_threads);

 dynamic_threads = nonzero value => the number of threads that are used for
executing subsequent parallel regions may be adjusted automatically by the run-
time environment to best utilize system resources.
 dynamic_threads = 0, dynamic adjustment is disabled.
8. omp_get_dynamic
 returns a nonzero value if dynamic adjustment of threads is enabled, and returns
0 otherwise
 Syntax : int omp_get_dynamic(void);

Example
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void main ()
{
int tid, procs, inpar, dynamic;
#pragma omp parallel /* Start parallel region */
{
tid = omp_get_thread_num(); /* Obtain thread number */
if (tid == 0) /* Only master thread does this */
{ Output
printf("Thread %d getting environment info...\n", tid); Thread 0 getting environment
info...
/* Get environment information */ Number of processors =3
procs = omp_get_num_procs(); In Parallel?=1
inpar = omp_in_parallel(); Dynamic threads enables?=0
dynamic = omp_get_dynamic();

/* Print environment information */


printf("Number of processors = %d\n", procs);
printf("In parallel? = %d\n", inpar);
printf("Dynamic threads enabled? = %d\n", dynamic);
}
}}

19
VIII. OpenMP : Handling Data and Functional Parallelism

8.1 Data Parallelism (refer 5.2 and 5.5.2)


Note: Write for work-sharing constructs and single work sharing construct.
8.2 Functional Parallelism
 OpenMp allows us to assign different threads to different portions of code
 Consider the example
V=alpha();
W=beta();
X=gamma(v,w);
Y=delta();
Printf(“%6.2f\n”,epsilon(x,y));
Dependency diagram for the given example

 Alpha,beta and
delta can be computed in parallel

 Gamma and epsilon can’t be computed parallely

 To implement functional parallelism, OpenMP uses Section constructs

 Dependency diagram for the given example

 The above code is written using section construct to achieve functional parallelism

Solution -1 (three threads needed)

#pragma omp parallel sections


{
#pragma omp section /* This pragma optional */
V=alpha();
#pragma omp section
W=beta();
#pragma omp section
Y=delta();
}
X=gamma(v,w);
Printf(“%6.2f\n”,epsilon(x,y));

 Another way to write above program

Solution-2(only Two threads needed)

#pragma omp parallel


{
#pragma omp sections
{

20
#pragma omp section /* This pragma optional */
V=alpha();
#pragma omp section
W=beta();
}
#pragma omp sections
{
#pragma omp section /* This pragma optional */
X=gamma(v,w);
#pragma omp section
Y=delta();
}
}
Printf(“%6.2f\n”,epsilon(x,y));

IX. Handling loops ( refer 5.2 fully)

X. OpenMP : Performance Considerations


General techniques for improving performance of OpenMP applications.
10.1 Minimize synchronization.
 Avoid or minimize the use of BARRIER, CRITICAL sections, ORDERED regions,
and locks.
 Use the NOWAIT clause where possible to eliminate redundant or unnecessary
barriers.
 Use explicit FLUSH with care. Flushes can cause data cache restores to memory,
and subsequent data accesses may require reloads from memory, all of which
decrease efficiency.
 If a SHARED variable in a parallel region is read by the threads executing the
region, but not written to by any of the threads, then specify that variable to be
FIRSTPRIVATE instead of SHARED. This avoids accessing the variable by
dereferencing a pointer, and avoids cache conflicts.
10.2 Inverting loops
A sequential for loop into a parallel for loop can actually increase a program’s
execution time.
There are three ways of improving the performance of parallel loops
i) Inverting loops
Consider the following code segment

for(i=1;i<m;i++)
for(j=0;j<n;j++)
a[i][j]=2*a[i-1][j]

data dependence diagram of array “a” in this code.

21
 two rows may not be updated simultaneously, because there are data
dependences between rows, but the columns may be updated simultaneously.
 That is the loop indexed by j may be executed in parallel, but not the loop indexed by
i.
 Inserting a parallel for pragma before the inner loop, the resulting parallel program
will execute correctly, because it requires m-1 fork/join steps, one per iteration of
the outer loop.
 However, if we invert the loops, only a single fork/join step is required.

#pragma parallel for private(i)


for(j=1;j<n; j++)
for(i=0;i<m; i++)
a[i][j]=2*a[i-1][j]

 the data dependences have not changed, the iterations of the loop indexed by j are
still independent of each other.
10.3 Scheduling loops (refer 5.2.1)
Note : Refer Scheduling concept in for constructs
10.4 Conditionally execution loops
 If a loop does not have enough iterations, the time spent forking and joining threads
may exceed the time saved by dividing the loop iterations among multiple threads.
 for example
Sum=0.0
#pragma omp parallel for private(x) reduction (+:area)
for(i=0;i<n;i++)
area+=i;
avg=sum/n;
 The if clause gives us the ability to direct the compiler to insert code that determines
at run-time whether the loop should be executed in parallel or sequentially.
 The clause has this syntax: if(scalar expression)

22
 If the scalar expression evaluates to true, the loop will be executed in parallel.
Otherwise, it will be executed serially.
 if clause to the parallel for pragma in the parallel program computing area.
#pragma omp parallel for private(x) reduction (+:area) if(n>5000)
for(i=0;i<n;i++)
area+=i;
avg=sum/n;
above loop iterations will be divided among multiple threads only if n>5000.

23

You might also like