Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
John Mellor-Crummey
Department of Computer Science Rice University johnmc@cs.rice.edu
COMP 422
synchronization directives
reductions, barrier, critical, ordered
tasks
What is OpenMP?
Open specifications for Multi Processing
Portable
API is specified for C/C++ and Fortran implementations on almost all platforms
Standardized
OpenMP at a Glance
OpenMP Is Not
An automatic parallel programming model
parallelism is explicit programmer full control (and responsibility) over parallelization
Necessarily implemented identically by all vendors Guaranteed to make the most efficient use of shared memory
no data locality control
simple & limited set of directives for shared memory programs significant parallelism possible using just 3 or 4 directives
both coarse-grain and fine-grain parallelism
If OpenMP is disabled when compiling a program, the program will execute sequentially
F o r k
J o i n
F o r k
J o i n
F o r k
J o i n
Conditional parallelization
if (scalar expression)
Degree of concurrency
num_threads(integer expression): # of threads to create
Data Scoping
private (variable list)
specifies variables local to each thread similar to the private private variables are initialized to variable value before the parallel directive specifies that variables are shared across all the threads default data scoping specifier may be shared or none
shared (b)
each thread shares a single copy of variable b
each thread gets private copies of variables a and c each private copy of c is initialized with the value of c in main thread when the parallel directive is encountered
default(none)
default state of a variable is specified as none (rather than shared) signals error if not all variables are specified as shared or private 10
OpenMP
Pthreads equivalent
11
Specifying Worksharing
Within the scope of a parallel directive, worksharing directives allow concurrency between iterations or tasks
12
Usage:
#pragma omp for [clause list] /* for loop */
13
14
15
num_threads = omp_get_num_threads(); manually sample_points_per_thread = npoints / num_threads; divides work sum = 0; for (i = 0; i < sample_points_per_thread; i++) { coord_x =(double)(rand_r(&seed))/(double)((2<<14)-1) - 0.5; coord_y =(double)(rand_r(&seed))/(double)((2<<14)-1) - 0.5; if ((coord_x * coord_x + coord_y * coord_y) < 0.25) sum ++; }
here, user
a local copy of sum for each thread all local copies of sum added together and stored in master
16
17
Recipe for mapping iterations to threads Usage: schedule(scheduling_class[, parameter]). Four scheduling classes
static: work partitioned at compile time
iterations statically divided into pieces of size chunk statically assigned to threads iterations are divided into pieces of size chunk chunks dynamically scheduled among the threads when a thread finishes one chunk, it is dynamically assigned another default chunk size is 1 chunk size is exponentially reduced with each dispatched piece of work the default minimum chunk size is 1 scheduling decision from environment variable OMP_SCHEDULE illegal to specify a chunk size for this clause.
runtime: 18
nowait clause
modifies a for directive avoids implicit barrier at end of for
20
Usage #pragma omp sections [clause list] { [#pragma omp section /* structured block */ ] [#pragma omp section /* structured block */ ] ... brackets here represent that } section is optional, not the syntax for using them
22
F o r k
F o r k
J o i n
F o r k
F o r k
J o i n
J o i n
24
single-threaded execution
26
27
28
Orphaned Directives
Execution rules
orphaned worksharing construct is executed serially when not called from within a parallel region
29
What is a task?
work unit
execution can begin immediately, or be deferred
components of a task
code to execute, data environment, internal control variables
Task execution
data environment is constructed at creation tasks are executed by threads of a team a task can be tied to a thread (i.e. migration/stealing not allowed)
by default: a task is tied to the first thread that executes it 30
Conditional parallelization
if (scalar expression)
determines whether the construct creates a task
Binding to threads
untied
Data scoping
private (variable list)
specifies variables local to the child task similar to the private private variables are initialized to value in parent task before the directive specifies that variables are shared with the parent task default data handling specifier may be shared or none
one x task created for each thread in the parallel region all x tasks complete at barrier
Static and global variables are shared Automatic (local) variables are private Variables for orphaned tasks are firstprivate by default Variables for non-orphaned tasks inherit the shared attribute
task variables are firstprivate unless shared in the enclosing context
33
need shared for x and y; default would be firstprivate suspend parent task until children finish
List Traversal
Element #pragma #pragma { for #pragma } Is the use of variables safe as written? first, e; omp parallel omp single (e = first; e; e = e->next) omp task firstprivate(e) process(e);
35
Task Scheduling
Tied tasks
only the thread that the task is tied to may execute it task can only be suspended at a suspend point
task creation task finish taskwait barrier
if a task is not suspended at a barrier, it can only switch to a descendant of any task tied to the thread
Untied tasks
no scheduling restrictions
can suspend at any point can switch to any task
37
Slower
Faster
38
!$OMP PARALLEL DO .... !$OMP DO statement 1 statement 2 !$OMP END DO .... !$OMP END PARALLEL
Slower
Faster
39
40
Processor count
int omp_get_num_procs(); /* # PE currently available */ int omp_in_parallel(); /* determine whether running in parallel */
41
Mutual exclusion
void omp_init_lock(omp_lock_t *lock); void omp_destroy_lock(omp_lock_t *lock); void omp_set_lock(omp_lock_t *lock); void omp_unset_lock(omp_lock_t *lock); int omp_test_lock(omp_lock_t *lock); Lock routines have a nested lock counterpart for recursive mutexes 42
OMP_DYNAMIC
specfies if the number of threads can be dynamically changed
OMP_NESTED
enables nested parallelism (may be nominal: one thread)
OMP_SCHEDULE
specifies scheduling of for-loops if the clause specifies runtime
OpenMP 3.0
integer value for maximum # nested parallel regions OMP_THREAD_LIMIT (# threads for entire program)
43
Directive advantages
directives facilitate a variety of thread-related tasks frees programmer from
initializing attribute objects setting up thread arguments partitioning iteration spaces,
Directive disadvantages
data exchange is less apparent
leads to mysterious overheads data movement, false sharing, and contention
44
References
Blaise Barney. LLNL OpenMP tutorial. http://www.llnl.gov/computing/ tutorials/openMP Adapted from slides Programming Shared Address Space Platforms by Ananth Grama Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Introduction to Parallel Computing. Chapter 7. Addison Wesley, 2003 Sun Microsystems. OpenMP OpenMP API User's Guide. Chapter 7 Performance Considerations http://docs.sun.com/source/ 819-3694/7_tuning.html Alberto Duran. OpenMP 3.0: Whats New?. IWOMP 2008. http:// cobweb.ecn.purdue.edu/ParaMount/iwomp2008/documents/omp30 Stephen Blair-Chappell. Expressing Parallelism Using the Intel Compiler. http://www.polyhedron.com/web_images/documents/Expressing %20Parallelism%20Using%20Intel%20Compiler.pdf Rusty Lusk et al. Programming Models and Runtime Systems, Exascale Software Center Meeting, ANL, Jan. 2011. 46