Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

Parallel Programming Module 3

Uploaded by

divyansh.death
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Parallel Programming Module 3

Uploaded by

divyansh.death
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

DS3202:

PARALLEL PROGRAMMING
MODULE – 3
Advanced OpenMP

6th SEM
B.Tech
DSE

1
Advanced OpenMP Constructs
Schedule clause
• Specifies how loop iteration are divided among team of threads.
• Schedule refers to the way in which loop indices are distributed
among threads
• The default schedule is implementation dependent.
• Supported scheduling types are:
• Static
• Dynamic
• Guided
• Runtime
• Auto
2
Why Schedule Clause?
• A parallel region has at least one barrier, at its end, and may have
additional barriers within it.
• At each barrier, the other members of the team must wait for the last
thread to arrive.
• To minimize this wait time, shared work should be distributed so that
all threads arrive at the barrier at about the same time.
• If some of that shared work is contained in for constructs, the
schedule clause can be used for this purpose.

3
Static Clause
• Default is static
• Loop iterations are divided into pieces of size chunk and then statically
assigned to threads.
• If chunk is not specified, the iteration are evenly (if possible) divided
contiguously among the threads.
• The size of a chunk, denoted as chunk_size, must be a positive integer.
• Each thread is assigned a contiguous range of indices in order of thread
number called round robin and is known as block cyclic scheduling
• Number of indices assigned to each thread is as equal as possible
• #pragma omp parallel for schedule (static,[n])
{
// ...some stuff
} 4
5
Example

6
7
8
Dynamic Clause
• Loop iterations are divided into chunks containing n iterations each and
then dynamically assigned to threads.
• When a thread finishes one chunk, it is dynamically assigned another.
• The default chunk size is 1.
• Iterations picked by threads depends upon the relative speeds of thread
execution
• #pragma omp parallel for schedule(dynamic, 1) is equivalent to
#pragma omp parallel for schedule(dynamic)
• #pragma omp parallel for schedule (dynamic, chunk_size)
for(i=0; i<8; i++)
{
… (loop body)
} 9
10
Example

11
12
Guided Clause
• Chunk size is dynamic while using guided method
• The size of a chunk is proportional to the number of unassigned iterations
divided by the number of the threads, and the size will be decreased to chunk-
size (but the last chunk could be smaller than chunk-size)
• chunk size = max((num_of_iterations remaining / 2*num_of_threads), n)
• The formula may differ across compiler implementations
• If you specify n, that is the minimum chunk size that each thread should get.
• Size of each successive chunks is decreasing
• The default chunk size is 1.
• #pragma omp parallel for schedule (guided, chunk_size)
for(i=0; i<8; i++)
{
… (loop body)
13
}
• The guided schedule is appropriate for the case in which the threads
may arrive at varying times at a for construct with each iteration
requiring about the same amount of work.
• This situation can happen if, for example, the for construct is
preceded by one or more sections or for constructs with nowait
clauses.

14
Example

15
16
Runtime Clause
• Determine the scheduling type at run time by the OMP_SCHEDULE
environment variable
• export OMP_SCHEDULE=“static, 4”
• Schedule can be specified through omp_schedule environment variable
• The schedule(runtime) clause tells it to set the schedule using the
environment variable
• The scheduling decision is deferred until runtime by the environment variable
OMP_SCHEDULE.
• It is illegal to specify a chunk size for this clause
• #pragma omp for schedule(runtime)
for(i=0; i<8; i++)
{
… (loop body)
17
}
Auto Clause
• With auto, scheduling is delegated to the compiler and runtime system.
• The compiler and runtime system can choose any possible mapping of
iterations to threads (including all possible valid schedules) and these
may be different in different loops.
• That is delegates the decision of the scheduling to the compiler and/or
runtime system.
• Thus, scheduling will be decided automatically by your machine.
• #pragma omp parallel for schedule (auto)
for(i=0; i<8; i++)
{
… (loop body)
}
18
Flush Directive
• Flush operation does not actually synchronize different threads.
• It just ensures that a thread’s values are made consistent with main memory.
• Flush forces data to updated in memory so other threads see the most recent
value.
• Thread-visible variables are written back to memory at this point.
• It prevents threads from observing stale or outdated values of variables.
• For pointers in the list, note that the pointer itself is flushed, not the object it
points to.

19
• However, processors can have their own registers and cache.
• –If a thread updates shared data, the new value will first be saved
in register and then stored back to the local cache.
• –The update are thus not necessarily immediately visible to other
threads.

20
Usage of flush directive
• The flush directive is typically used in conjunction with shared variables that
are accessed by multiple threads within a parallel region.
• Placed before a memory read operation to ensure that the most recent
values of shared variables are observed.
• Placed after a memory write operation to ensure that the updated values
are visible to other threads.
• Syntax : #pragma omp flush(list)
• where list is a comma-separated list of variables whose updates need to be
synchronized across threads.
• The list contains a list of named variables that will be flushed in order to
avoid flushing all variables.

21
• The FLUSH directive is implied for the directives shown in the table
below.
• The directive is not implied if a NOWAIT clause is present.

22
Memory ordering using flush directive
• The flush directive imposes a memory ordering constraint on the execution
of the program, ensuring that updates to variables are propagated in the
specified order.
• Prevents reordering of memory operations by the compiler or hardware,
which could lead to incorrect program behavior.

23
Example Code-Flush Directive

24
Explanation of code
• In this code, we have a shared variable shared_var. Thread 0 updates the
shared variable to the value 10.
• We use the flush directive after updating shared_var to ensure that the
update is visible to other threads.
• All threads then wait at the barrier (#pragma omp barrier) to ensure that
the update is visible before proceeding.
• Thread 0 updates the shared variable shared_var to the value 10 and
flushes the update. Both threads then wait at the barrier.
• Once the barrier is reached and the update is visible to all threads, both
threads print the value of the shared variable, which is 10 in this case.

25
Nested Parallelism
• OpenMP parallel regions can be nested inside each other.
• That means the ability to create parallel regions within other parallel regions.
• Nested parallelism enables a hierarchical structure of parallelism.
• Nested parallelism can potentially improve the utilization of computing resources.
• You need to turn on nested parallelism by setting OMP_NESTED or
omp_set_nested because many implementations turn off this feature by default.
• If nested parallelism is disabled, then the new team created by a thread
encountering a parallel construct inside a parallel region consists only of the
encountering thread.
• If nested parallelism is enabled, then the new team may consist of more than one
thread.
26
Creation of nested parallel regions
• Nested parallel regions are created by encountering additional
parallel constructs (#pragma omp parallel) within an existing parallel
region.
• When a thread encounters a nested parallel construct, it creates a
team of threads to execute the nested parallel region.
• The number of threads created for the nested parallel region may be
determined by the num_threads clause or the environment settings.

27
Example code 1– Nested Parallelism

28
Explanation of Code
• Each thread in the outer parallel region will print its thread ID.
• Then, each thread in the outer parallel region will enter the inner parallel
region.
• Within the inner parallel region, each thread will print its thread ID again.
• This demonstrates that each thread in the outer parallel region can
spawn threads to execute the inner parallel region.

29
Example 2

30
31
32
Performance Considerations in Nested Parallelism

• Nested parallelism allows for more granular control over thread allocation
and resource utilization.
• The threads from outer parallel regions can be reused or allocated to
execute inner parallel regions.
• While nested parallelism can potentially improve performance by exploiting
additional parallelism, it can also introduce overhead due to thread
management and synchronization.
• Excessive nesting may lead to diminishing returns or increased overhead.

33
Thread private directive
• Used to declare variables that should have private instances for each thread
in a parallel region.
• Variables declared with the "thread private" directive have private instances
for each thread in a parallel region.
• Syntax : #pragma omp threadprivate(variable_list)
• where variable_list is a comma-separated list of variables that should
have private instances for each thread.

34
Scope and initialization
• Scope :
• The "thread private" directive typically appears outside of parallel regions, often at
the global or file scope.
• Specifies that the listed variables should be treated as thread-private for all
subsequent parallel regions.
• Initialization :
• Thread-private variables are initialized once at the beginning of the program
execution.
• The initial values are shared across all threads but are independent of the initial
values of variables in other threads.

35
Usage and memory overhead
• Usage : Commonly used for global variables or variables declared at file
scope that need to be shared across multiple parallel regions but have
private instances for each thread.
• Memory Overhead :
• Using the "thread private" directive incurs memory overhead, as each thread
maintains its own private copy of the variable.
• Hence, it should be used judiciously, especially for large variables or in scenarios with
a large number of threads.

36
Example code of thread private
#include <stdio.h>
#include <omp.h>
// Declaring a global variable as threadprivate
#pragma omp threadprivate(global_var)

// Global variable declared as threadprivate


int global_var = 0;

int main() {
// Setting the global variable's value to the thread's ID
global_var = omp_get_thread_num();

// Parallel region with two threads


#pragma omp parallel num_threads(2)
{
// Each thread prints its own thread-private value of the global variable
printf("Thread %d: Global Variable = %d\n", omp_get_thread_num(),
global_var);
}
37
return 0; }
Code explanation
• The global variable global_var is marked as threadprivate, so each thread
has its own private instance of this variable.
• Before entering the parallel region, the main thread sets the value of
global_var to its own thread ID (which is 0 in this case).
• Within the parallel region, there are two threads.
• Thread 0 prints its own private instance of global_var, which retains the
value set outside the parallel region (0).
• Thread 1 also prints its own private instance of global_var, which retains the
value set outside the parallel region (1).

38
Example 2

39
40
• Data in THREADPRIVATE objects is guaranteed to persist only if the
dynamic threads mechanism is “turned off” and the number of
threads in different parallel regions remains constant. The default
setting of dynamic threads is undefined.

• The THREADPRIVATE directive must appear after every declaration of


a thread private variable/common block.

41
Difference between private and thread private
• Scope :
• The "thread private" directive is typically used at the global or file scope and specifies
that the listed variables should have private instances for each thread in all subsequent
parallel regions.
• The "private" data scope attribute clause is used within parallel constructs (e.g.,
parallel, for, sections) to declare private variables and applies only within the specific
parallel region where it is used and affects only that region.
• Usage :
• Thread private directive is used for variables that need to be shared across multiple
parallel regions but have private instances for each thread.
• Private data scope attribute clause is used to declare private variables within a specific
parallel region.

42
Continued………

• Initialization :
• Variables declared with the "thread private" directive are initialized once at the
beginning of the program execution.
• Private variables declared within a parallel region are typically uninitialized at the start
of the region and may retain their previous values if they have been used in an outer
scope.
• Memory overhead :
• Thread private directive Incurs memory overhead as each thread maintains its own
private copy of the variable.
• Private variables declared within a parallel region do not incur memory overhead
outside of that region.

43
End of Module 3
(THANK YOU)

44

You might also like