Lecture Open MP
Lecture Open MP
Programming with
OpenMP
CS240A, T. Yang
1
A Programmer’s View of OpenMP
• What is OpenMP?
• Open specification for Multi-Processing
• “Standard” API for defining multi-threaded shared-memory
programs
• openmp.org – Talks, examples, forums, etc.
• OpenMP is a portable, threaded, shared-memory
programming specification with “light” syntax
• Exact behavior depends on OpenMP implementation!
• Requires compiler support (C or Fortran)
• OpenMP will:
• Allow a programmer to separate a program into serial regions and
parallel regions, rather than T concurrently-executing threads.
• Hide stack management
• Provide synchronization constructs
• OpenMP will not:
• Parallelize automatically
• Guarantee speedup
• Provide freedom from data races 2
Motivation – OpenMP
int main() {
return 0;
}
3
Motivation – OpenMP
omp_set_num_threads(4);
return 0;
}
4
OpenMP parallel region construct
• Block of code to be executed by multiple threads in
parallel
• Each thread executes the same code redundantly
(SPMD)
• Work within work-sharing constructs is distributed among the
threads in a team
• Example with C/C++ syntax
#pragma omp parallel [ clause [ clause ] ... ] new-line
structured-block
• clause can include the following:
private (list)
shared (list)
• Example: OpenMP default is shared variables
To make private, need to declare with pragma:
#pragma omp parallel private (x)
OpenMP Programming Model - Review
• Fork - Join Model:
Thread 0 Thread 0
Thread 1 Thread 1
Sequential
code
6
parallel Pragma and Scope – More Examples
X=1; x=1;
y=1+x; y=1+x;
X=1; x=1;
y=x+1; y=x+1;
}
Thread 0 Thread 1 Thread 2 Thread 3
Id=0; Id=1; Id=2; Id=3;
x[0]=0; x[1]=0; x[2]=0; x[3]=0;
X[4]=0; X[5]=0; X[6]=0; X[7]=0;
10
Use pragma parallel for
11
OpenMP Data Parallel Construct: Parallel Loop
• Compiler calculates loop bounds for each thread directly
from serial source (computation decomposition)
• Compiler also manages data partitioning
• Synchronization also automatic (barrier)
Programming Model – Parallel Loops
• Requirement for parallel loops
• No data dependencies
(reads/write or write/write
pairs) between iterations!
16
Sequential Calculation of π in C
void main () {
int i; double x, pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
#pragma omp parallel private ( i, x )
{
int id = omp_get_thread_num();
for (i=id, sum[id]=0.0; i< num_steps; i=i+NUM_THREADS)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i=1; i<NUM_THREADS; i++)
sum[0] += sum[i]; pi = sum[0] / num_steps
printf ("pi = %6.12f\n", pi);
}
18
OpenMP Reduction
void main () {
int i; double x, pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
#pragma omp parallel private ( i, x )
{
int id = omp_get_thread_num();
for (i=id, sum[id]=0.0; i< num_steps; i=i+NUM_THREADS)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i=1; i<NUM_THREADS; i++)
sum[0] += sum[i]; pi = sum[0] / num_steps
printf ("pi = %6.12f\n", pi);
}
21
Version 2: parallel for, reduction
#include <omp.h>
#include <stdio.h>
/static long num_steps = 100000;
double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum)
for (i=1; i<= num_steps; i++){
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = sum / num_steps;
printf ("pi = %6.8f\n", pi);
}
22
Loop Scheduling in Parallel for pragma
23
Impact of Scheduling Decision
• Load balance
• Same work in each iteration?
• Processors working at same speed?
• Scheduling overhead
• Static decisions are cheap because they require no run-time
coordination
• Dynamic decisions have overhead that is impacted by
complexity and frequency of decisions
• Data locality
• Particularly within cache lines for small chunk sizes
• Also impacts data reuse on same processor
OpenMP environment variables
OMP_NUM_THREADS
§ sets the number of threads to use during execution
§ when dynamic adjustment of the number of threads is enabled, the
value of this environment variable is the maximum number of
threads to use
§ For example,
setenv OMP_NUM_THREADS 16 [csh, tcsh]
export OMP_NUM_THREADS=16 [sh, ksh, bash]
OMP_SCHEDULE
§ applies only to do/for and parallel do/for directives that
have the schedule type RUNTIME
§ sets schedule type and chunk size for all such loops
§ For example,
setenv OMP_SCHEDULE GUIDED,4 [csh, tcsh]
export OMP_SCHEDULE= GUIDED,4 [sh, ksh, bash]
Programming Model – Loop Scheduling
•schedule clause determines how loop iterations are
divided among the thread team
• static([chunk]) divides iterations statically between
threads
• Each thread receives [chunk] iterations, rounding as necessary
to account for all iterations
• Default [chunk] is ceil( # iterations / # threads )
• dynamic([chunk]) allocates [chunk] iterations per thread,
allocating an additional [chunk] iterations when a thread
finishes
• Forms a logical work queue, consisting of all loop iterations
• Default [chunk] is 1
• guided([chunk]) allocates dynamically, but [chunk] is
exponentially reduced with each allocation
26
Loop scheduling options
2(2)
Programming Model – Data Sharing
• Parallel programs often employ // shared, globals
two types of data
int bigdata[1024];
• Shared data, visible to all
threads, similarly named
• Private data, visible to a single
void* foo(void* bar) {
thread (often stack-allocated)
intprivate,
// tid; stack
• PThreads:
• Global-scoped variables are int tid;
shared #pragma omp parallel \
• Stack-allocated variables are
shared
/* ( bigdata
Calculation ) \
goes
private
private ( tid )
here */
• OpenMP:
• shared variables are shared } {
• private variables are private /* Calc. here */
}
}
28
Programming Model - Synchronization
• OpenMP Synchronization #pragma omp critical
• OpenMP Critical Sections {
• Named or unnamed /* Critical code here */
• No explicit locks / mutexes }
• Barrier directives
#pragma omp barrier
29
Omp critical vs. atomic
int sum=0
#pragma omp parallel for
for(int j=1; j <n; j++){
int x = j*j;
#pragma omp critical
{
sum=sum+x;// One thread enters the critical section at a time.
}
* May also use
#pragma omp atomic
x += exper
• Faster, but can support only limited arithmetic operation such as
++, --, +=, -=, *=, /=, &=, |=
30
OpenMP Timing
31
Parallel Matrix Multiply: Run Tasks Ti in parallel on
multiple threads
T1
T1 T2
Parallel Matrix Multiply: Run Tasks Ti in parallel on
multiple threads
T2
T1 T2 33
Matrix Multiply in OpenMP
34
OpenMP Summary
• OpenMP is a compiler-based technique to create
concurrent code from (mostly) serial code
• OpenMP can enable (easy) parallelization of loop-based
code with fork-join parallelism
• pragma omp parallel
• pragma omp parallel for
• pragma omp parallel private ( i, x )
• pragma omp atomic
• pragma omp critical
• #pragma omp for reduction(+ : sum)
35