Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
58 views46 pages

Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 46

Programming Shared-memory Platforms with OpenMP

John Mellor-Crummey
Department of Computer Science Rice University johnmc@cs.rice.edu

COMP 422

Lecture 7 01 February 2011

Topics for Today



Introduction to OpenMP OpenMP directives
concurrency directives
parallel regions loops, sections, tasks

synchronization directives
reductions, barrier, critical, ordered

data handling clauses


shared, private, firstprivate, lastprivate

tasks

Performance tuning hints Library primitives Environment variables


2

What is OpenMP?
Open specifications for Multi Processing

An API for explicit multi-threaded, shared memory parallelism Three components


compiler directives runtime library routines environment variables

Higher-level programming model than Pthreads


implicit mapping and load balancing of work

Portable
API is specified for C/C++ and Fortran implementations on almost all platforms

Standardized

OpenMP at a Glance

Application Compiler Runtime Library OS Threads (e.g., Pthreads)

User Environment Variables

OpenMP Is Not

An automatic parallel programming model
parallelism is explicit programmer full control (and responsibility) over parallelization

Meant for distributed-memory parallel systems (by itself)


designed for shared address spaced machines

Necessarily implemented identically by all vendors Guaranteed to make the most efficient use of shared memory
no data locality control

OpenMP Targets Ease of Use



OpenMP does not require that single-threaded code be changed for threading
enables incremental parallelization of a serial program

OpenMP only adds compiler directives


pragmas (C/C++); significant comments in Fortran
if a compiler does not recognize a directive, it simply ignores it

simple & limited set of directives for shared memory programs significant parallelism possible using just 3 or 4 directives
both coarse-grain and fine-grain parallelism

If OpenMP is disabled when compiling a program, the program will execute sequentially

OpenMP: Fork-Join Parallelism



OpenMP program begins execution as a single master thread Master thread executes sequentially until 1st parallel region When a parallel region is encountered, master thread
creates a group of threads becomes the master of this group of threads is assigned the thread id 0 within the group

F o r k

J o i n

F o r k

J o i n

F o r k

J o i n

master thread shown in red

OpenMP Directive Format

OpenMP directive forms


C and C++ use compiler directives
prefix: #pragma

Fortran uses significant comments


prefixes: !$omp, c$omp, *$omp

A directive consists of a directive name followed by clauses


C: #pragma omp parallel default(shared) private(beta,pi) Fortran: !$omp parallel default(shared) private(beta,pi)

OpenMP parallel Region Directive


#pragma omp parallel [clause list]

Typical clauses in [clause list]

Conditional parallelization
if (scalar expression)

A few more clauses on slide 37

determines whether the parallel construct creates threads

Degree of concurrency
num_threads(integer expression): # of threads to create

Data Scoping
private (variable list)
specifies variables local to each thread similar to the private private variables are initialized to variable value before the parallel directive specifies that variables are shared across all the threads default data scoping specifier may be shared or none

firstprivate (variable list)

shared (variable list) default (data scoping specifier) 9

Interpreting an OpenMP Parallel Directive


#pragma omp parallel if (is_parallel==1) num_threads(8) \ shared (b) private (a) firstprivate(c) default(none) { /* structured block */ }

Meaning if (is_parallel== 1) num_threads(8)


If the value of the variable is_parallel is one, create 8 threads

shared (b)
each thread shares a single copy of variable b

private (a) firstprivate(c)

each thread gets private copies of variables a and c each private copy of c is initialized with the value of c in main thread when the parallel directive is encountered

default(none)
default state of a variable is specified as none (rather than shared) signals error if not all variables are specified as shared or private 10

Meaning of OpenMP Parallel Directive

OpenMP

Pthreads equivalent

11

Specifying Worksharing
Within the scope of a parallel directive, worksharing directives allow concurrency between iterations or tasks

OpenMP provides two directives


DO/for: concurrent loop iterations sections: concurrent tasks

12

Worksharing DO/for Directive


for directive partitions parallel iterations across threads DO is the analogous directive for Fortran

Usage:
#pragma omp for [clause list] /* for loop */

Possible clauses in [clause list]


private, firstprivate, lastprivate reduction schedule, nowait, and ordered

Implicit barrier at end of for loop

13

A Simple Example Using parallel and for


Program void main() { #pragma omp parallel num_threads(3) { int i; printf(Hello world\n); #pragma omp for for (i = 1; i <= 4; i++) { printf(Iteration %d\n,i); } printf(Goodbye world\n); } } Output Hello world Hello world Hello world Iteration 1 Iteration 2 Iteration 3 Iteration 4 Goodbye world Goodbye world Goodbye world

14

Reduction Clause for Parallel Directive


Specifies how to combine local copies of a variable in different threads into a single copy at the master when threads exit

Usage: reduction (operator: variable list)


variables in list are implicitly private to threads

Reduction operators: +, *, -, &, |, ^, &&, and || Usage sketch


#pragma omp parallel reduction(+: sum) num_threads(8) { /* compute local sum in each thread here */ } /* sum here contains sum of all local instances of sum */

15

OpenMP Reduction Clause Example


OpenMP threaded program to estimate PI
#pragma omp parallel default(private) shared (npoints) \
reduction(+: sum) num_threads(8)

num_threads = omp_get_num_threads(); manually sample_points_per_thread = npoints / num_threads; divides work sum = 0; for (i = 0; i < sample_points_per_thread; i++) { coord_x =(double)(rand_r(&seed))/(double)((2<<14)-1) - 0.5; coord_y =(double)(rand_r(&seed))/(double)((2<<14)-1) - 0.5; if ((coord_x * coord_x + coord_y * coord_y) < 0.25) sum ++; }

here, user

a local copy of sum for each thread all local copies of sum added together and stored in master

16

Using Worksharing for Directive


#pragma omp parallel default(private) shared (npoints) \ reduction(+: sum) num_threads(8) { sum = 0; worksharing for #pragma omp for divides work for (i = 0; i < npoints; i++) { rand_no_x =(double)(rand_r(&seed))/(double)((2<<14)-1); rand_no_y =(double)(rand_r(&seed))/(double)((2<<14)-1); if (((rand_no_x - 0.5) * (rand_no_x - 0.5) + (rand_no_y - 0.5) * (rand_no_y - 0.5)) < 0.25) sum ++; } Implicit barrier at end of loop }

17

Mapping Iterations to Threads


schedule clause of the for directive

Recipe for mapping iterations to threads Usage: schedule(scheduling_class[, parameter]). Four scheduling classes
static: work partitioned at compile time
iterations statically divided into pieces of size chunk statically assigned to threads iterations are divided into pieces of size chunk chunks dynamically scheduled among the threads when a thread finishes one chunk, it is dynamically assigned another default chunk size is 1 chunk size is exponentially reduced with each dispatched piece of work the default minimum chunk size is 1 scheduling decision from environment variable OMP_SCHEDULE illegal to specify a chunk size for this clause.

dynamic: work evenly partitioned at run time

guided: guided self-scheduling

runtime: 18

Statically Mapping Iterations to Threads


/* static scheduling of matrix multiplication loops */ #pragma omp parallel default(private) \ shared (a, b, c, dim) num_threads(4) #pragma omp for schedule(static) for (i = 0; i for (j = 0; c(i,j) = for (k = c(i,j) } } } < dim; i++) { j < dim; j++) { 0; 0; k < dim; k++) { += a(i, k) * b(k, j);

static schedule maps iterations to threads at compile time


19

Avoiding Unwanted Synchronization



Default: worksharing for loops end with an implicit barrier Often, less synchronization is appropriate
series of independent for-directives within a parallel construct

nowait clause
modifies a for directive avoids implicit barrier at end of for

20

Avoiding Synchronization with nowait


#pragma omp parallel { #pragma omp for nowait for (i = 0; i < nmax; i++) if (isEqual(name, current_list[i]) processCurrentName(name); #pragma omp for for (i = 0; i < mmax; i++) if (isEqual(name, past_list[i]) processPastName(name); }
any thread can begin second loop immediately without waiting for other threads to finish first loop
21

Worksharing sections Directive


sections directive enables specification of task parallelism

Usage #pragma omp sections [clause list] { [#pragma omp section /* structured block */ ] [#pragma omp section /* structured block */ ] ... brackets here represent that } section is optional, not the syntax for using them
22

Using the sections Directive


parallel section encloses all parallel work #pragma omp parallel { #pragma omp sections sections: task parallelism { #pragma omp section { taskA(); } three concurrent tasks #pragma omp section need not be procedure calls { taskB(); } #pragma omp section { taskC(); } } }
23

Nesting parallel Directives



Nested parallelism enabled using the OMP_NESTED environment variable
OMP_NESTED = TRUE nested parallelism is enabled

Each parallel directive creates a new team of threads

F o r k

J o i n master thread shown in red

F o r k

J o i n

F o r k

F o r k

J o i n

J o i n

24

Synchronization Constructs in OpenMP


#pragma omp barrier wait until all threads arrive here

#pragma omp single [clause list]


structured block

#pragma omp master


structured block

single-threaded execution

Use MASTER instead of SINGLE wherever possible


MASTER = IF-statement with no implicit BARRIER
equivalent to IF(omp_get_thread_num() == 0) {...}

SINGLE: implemented like other worksharing constructs


keeping track of which thread reached SINGLE first adds overhead 25

Synchronization Constructs in OpenMP

#pragma omp critical [(name)] critical section: like a named lock


structured block

#pragma omp ordered


structured block

for loops with carried dependences

26

Example Using critical


#pragma omp parallel { #pragma omp for nowait shared(best_cost) for (i = 0; i < nmax; i++) { int my_cost; #pragma omp critical { if (best_cost < my_cost) best_cost = my_cost; } critical ensures mutual exclusion } when accessing shared state }

27

Example Using ordered


#pragma omp parallel { #pragma omp for nowait shared(a) for (k = 0; k < nmax; k++) { #pragma omp ordered { a[k] = a[k-1] + ; } } } ordered ensures carried dependence does not cause a data race

28

Orphaned Directives

Directives may not be lexically nested in a parallel region


may occur in a separate program unit
... !$omp parallel call phase1 call phase2 !$omp end parallel ... subroutine phase1 !$omp do private(i) shared(n) do i = 1, n call some_work(i) end do !$omp end do end subroutine phase2 !$omp do private(j) shared(n) do j = 1, n call more_work(j) end do !$omp end do end

Dynamically bind to enclosing parallel region at run time Benefits


enables parallelism to be added with a minimum of restructuring improves performance: enables single parallel region to bind with worksharing constructs in multiple called routines

Execution rules
orphaned worksharing construct is executed serially when not called from within a parallel region
29

OpenMP 3.0 Tasks

Motivation: support parallelization of irregular problems


unbounded loops recursive algorithms producer consumer

What is a task?
work unit
execution can begin immediately, or be deferred

components of a task
code to execute, data environment, internal control variables

Task execution
data environment is constructed at creation tasks are executed by threads of a team a task can be tied to a thread (i.e. migration/stealing not allowed)
by default: a task is tied to the first thread that executes it 30

OpenMP 3.0 Tasks


#pragma omp task [clause list]

Possible clauses in [clause list]

Conditional parallelization
if (scalar expression)
determines whether the construct creates a task

Binding to threads
untied

Data scoping
private (variable list)
specifies variables local to the child task similar to the private private variables are initialized to value in parent task before the directive specifies that variables are shared with the parent task default data handling specifier may be shared or none

firstprivate (variable list)

shared (variable list) default (data handling specifier) 31

Composing Tasks and Regions


#pragma omp parallel { #pragma omp task x(); #pragma omp barrier #pragma omp single { #pragma omp task y(); } }

one x task created for each thread in the parallel region all x tasks complete at barrier

one y task created region end: y task completes


32

Data Scoping for Tasks is Tricky


If no default clause specified

Static and global variables are shared Automatic (local) variables are private Variables for orphaned tasks are firstprivate by default Variables for non-orphaned tasks inherit the shared attribute
task variables are firstprivate unless shared in the enclosing context

33

Fibonacci using (Orphaned) OpenMP 3.0 Tasks


int fib ( int n ) { int x,y; if ( n < 2 ) return n; #pragma omp task shared(x) x = fib(n - 1); #pragma omp task shared(y) y = fib(n - 2); #pragma omp taskwait return x + y; } int main (int argc, char **argv) { int n, result; n = atoi(argv[1]); #pragma omp parallel { create team #pragma omp single { of threads to result = fib(n); execute tasks } } printf(fib(%d) = %d\n, n, result); }

need shared for x and y; default would be firstprivate suspend parent task until children finish

only one thread performs the outermost call


34

List Traversal
Element #pragma #pragma { for #pragma } Is the use of variables safe as written? first, e; omp parallel omp single (e = first; e; e = e->next) omp task firstprivate(e) process(e);

35

Task Scheduling

Tied tasks
only the thread that the task is tied to may execute it task can only be suspended at a suspend point
task creation task finish taskwait barrier

if a task is not suspended at a barrier, it can only switch to a descendant of any task tied to the thread

Untied tasks
no scheduling restrictions
can suspend at any point can switch to any task

implementation may schedule for locality and/or load balance


36

Summary of Clause Applicability

37

Performance Tuning Hints


Parallelize at the highest level, e.g. outermost DO/for loops
!$OMP PARALLEL .... do j = 1, 20000 !$OMP DO do k = 1, 10000 ... enddo !k !$OMP END DO enddo !j ... !$OMP END PARALLEL !$OMP PARALLEL .... !$OMP DO do k = 1, 10000 do j = 1, 20000 ... enddo !j enddo !k !$OMP END DO ... !$OMP END PARALLEL

Slower

Faster

38

Performance Tuning Hints


Merge independent parallel loops when possible
!$OMP PARALLEL DO .... !$OMP DO statement 1 !$OMP END DO !$OMP DO statement 2 !$OMP END DO .... !$OMP END PARALLEL

!$OMP PARALLEL DO .... !$OMP DO statement 1 statement 2 !$OMP END DO .... !$OMP END PARALLEL

Slower

Faster

39

Performance Tuning Hints


Minimize use of synchronization

BARRIER CRITICAL sections


if necessary, use named CRITICAL for fine-grained locking

ORDERED regions Use NOWAIT clause to avoid unnecessary barriers


adding NOWAIT to a regions final DO eliminates a redundant barrier

Use explicit FLUSH with care


flushes can evict cached values subsequent data accesses may require reloads from memory

40

OpenMP Library Functions

Processor count
int omp_get_num_procs(); /* # PE currently available */ int omp_in_parallel(); /* determine whether running in parallel */

Thread count and identity


/* max # threads for next parallel region. only call in serial region */

void omp_set_num_threads(int num_threads); int omp_get_num_threads(); /*# threads currently active */

41

OpenMP Library Functions

Controlling and monitoring thread creation


void omp_set_dynamic (int dynamic_threads); int omp_get_dynamic (); void omp_set_nested (int nested); int omp_get_nested ();

Mutual exclusion
void omp_init_lock(omp_lock_t *lock); void omp_destroy_lock(omp_lock_t *lock); void omp_set_lock(omp_lock_t *lock); void omp_unset_lock(omp_lock_t *lock); int omp_test_lock(omp_lock_t *lock); Lock routines have a nested lock counterpart for recursive mutexes 42

OpenMP Environment Variables



OMP_NUM_THREADS
specifies the default number of threads for a parallel region

OMP_DYNAMIC
specfies if the number of threads can be dynamically changed

OMP_NESTED
enables nested parallelism (may be nominal: one thread)

OMP_SCHEDULE
specifies scheduling of for-loops if the clause specifies runtime

OMP_STACKSIZE (for non-master threads) OMP_WAIT_POLICY (ACTIVE or PASSIVE) OMP_MAX_ACTIVE_LEVELS

OpenMP 3.0

integer value for maximum # nested parallel regions OMP_THREAD_LIMIT (# threads for entire program)
43

OpenMP Directives vs. Pthreads

Directive advantages
directives facilitate a variety of thread-related tasks frees programmer from
initializing attribute objects setting up thread arguments partitioning iteration spaces,

Directive disadvantages
data exchange is less apparent
leads to mysterious overheads data movement, false sharing, and contention

API is less expressive than Pthreads


lacks condition waits, locks of different types, and flexibility for building composite synchronization operations

44

The Future of OpenMP



OpenMP 3.1 standard is being finalized; OpenMP 4.0 standardization process has begun; topics under discussion are the following:
support for heterogeneous systems (GPUs etc.) tasking model refinement locality and affinity thread team control transactional memory and thread-level speculation additional synchronization mechanisms OpenMP error model interoperability and composability tools support in Spec consideration of Fortran 2003, other language bindings (Java, Python)
45

References

Blaise Barney. LLNL OpenMP tutorial. http://www.llnl.gov/computing/ tutorials/openMP Adapted from slides Programming Shared Address Space Platforms by Ananth Grama Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Introduction to Parallel Computing. Chapter 7. Addison Wesley, 2003 Sun Microsystems. OpenMP OpenMP API User's Guide. Chapter 7 Performance Considerations http://docs.sun.com/source/ 819-3694/7_tuning.html Alberto Duran. OpenMP 3.0: Whats New?. IWOMP 2008. http:// cobweb.ecn.purdue.edu/ParaMount/iwomp2008/documents/omp30 Stephen Blair-Chappell. Expressing Parallelism Using the Intel Compiler. http://www.polyhedron.com/web_images/documents/Expressing %20Parallelism%20Using%20Intel%20Compiler.pdf Rusty Lusk et al. Programming Models and Runtime Systems, Exascale Software Center Meeting, ANL, Jan. 2011. 46

You might also like