Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
151 views

Openmp Tutorial: Seung-Jai Min

This document provides an introduction to parallel programming using OpenMP. It discusses how OpenMP is used to parallelize loops across multiple threads using compiler directives. OpenMP provides a fork-join model of parallel execution where parallel regions are flanked by sequential regions. Within parallel regions, data can be declared as shared, where it is accessible by all threads, or private, where it is accessible only by the thread that owns it.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views

Openmp Tutorial: Seung-Jai Min

This document provides an introduction to parallel programming using OpenMP. It discusses how OpenMP is used to parallelize loops across multiple threads using compiler directives. OpenMP provides a fork-join model of parallel execution where parallel regions are flanked by sequential regions. Within parallel regions, data can be declared as shared, where it is accessible by all threads, or private, where it is accessible only by the thread that owns it.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

OpenMP Tutorial

Seung-Jai Min
(smin@purdue.edu)

School of Electrical and Computer Engineering


Purdue University, West Lafayette, IN

ECE 563 Programming Parallel Machines 1


Parallel Programming Standards

• Thread Libraries
- Win32 API / Posix threads
• Compiler Directives OUR FOCUS
- OpenMP (Shared memory programming)
• Message Passing Libraries
- MPI (Distributed memory programming)

ECE 563 Programming Parallel Machines 2


Shared Memory Parallel
Programming in the Multi-Core Era

• Desktop and Laptop


– 2, 4, 8 cores and … ?
• A single node in distributed memory clusters
– Steele cluster node: 2  8  (16) cores
• Shared memory hardware Accelerators
• Cell processors: 1 PPE and 8 SPEs
• Nvidia Quadro GPUs: 128 processing units

ECE 563 Programming Parallel Machines 3


OpenMP:
Some syntax details to get us started
• Most of the constructs in OpenMP are compiler
directives or pragmas.
– For C and C++, the pragmas take the form:
#pragma omp construct [clause [clause]…]
– For Fortran, the directives take one of the forms:
C$OMP construct [clause [clause]…]
!$OMP construct [clause [clause]…]
*$OMP construct [clause [clause]…]
• Include files
#include “omp.h”

ECE 563 Programming Parallel Machines 4


How is OpenMP typically used?
• OpenMP is usually used to parallelize loops:
• Find your most time consuming loops.
• Split them up between threads.
Parallel Program
Sequential Program
#include “omp.h”
void main() void main()
{ {
int i, k, N=1000; int i, k, N=1000;
double A[N], B[N], C[N]; double A[N], B[N], C[N];
for (i=0; i<N; i++) { #pragma omp parallel for
A[i] = B[i] + k*C[i] for (i=0; i<N; i++) {
} A[i] = B[i] + k*C[i];
} }
}

ECE 563 Programming Parallel Machines 5


How is OpenMP typically used?
(Cont.)
• Single Program Multiple Data (SPMD)
Parallel Program

#include “omp.h”
void main()
{
int i, k, N=1000;
double A[N], B[N], C[N];
#pragma omp parallel for
for (i=0; i<N; i++) {
A[i] = B[i] + k*C[i];
}
}

ECE 563 Programming Parallel Machines 6


How is OpenMP typically used?
(Cont.)
• Single Program Multiple Data (SPMD)
Thread 0
Thread 1
void main() Thread 2
{ void main() Thread 3
{
int i, k, N=1000; void main()
int i,C[N];
double A[N], B[N], {
k, N=1000; void main()
lb = 0; int i,C[N];
double A[N], B[N], {
k, N=1000;
ub = 250; lb = 250; int i,C[N];
double A[N], B[N], k, N=1000;
ub = 500;
for (i=lb;i<ub;i++) { lb = 500; double A[N], B[N], C[N];
A[i] = B[i] for ub = 750;
(i=lb;i<ub;i++)
+ k*C[i]; { lb = 750;
} A[i] = B[i] for
+ k*C[i]; ub = 1000;
(i=lb;i<ub;i++) {
} } A[i] = B[i] for (i=lb;i<ub;i++) {
+ k*C[i];
} } A[i] = B[i] + k*C[i];
} }
}

ECE 563 Programming Parallel Machines 7


OpenMP Fork-and-Join model

printf(“program begin\n”);
N = 1000; Serial

#pragma omp parallel for


for (i=0; i<N; i++) Parallel
A[i] = B[i] + C[i];

M = 500; Serial
#pragma omp parallel for
for (j=0; j<M; j++) Parallel
p[j] = q[j] – r[j];
printf(“program done\n”); Serial

ECE 563 Programming Parallel Machines 8


OpenMP Constructs
1. Parallel Regions
– #pragma omp parallel
2. Worksharing
– #pragma omp for, #pragma omp sections
3. Data Environment
– #pragma omp parallel shared/private (…)
4. Synchronization
– #pragma omp barrier
5. Runtime functions/environment variables
– int my_thread_id = omp_get_num_threads();
– omp_set_num_threads(8);

ECE 563 Programming Parallel Machines 9


OpenMP: Structured blocks

• Most OpenMP constructs apply to structured


blocks.
– Structured block: one point of entry at the top and
one point of exit at the bottom.
– The only “branches” allowed are STOP statements in
Fortran and exit() in C/C++.

ECE 563 Programming Parallel Machines 10


OpenMP: Structured blocks

A Structured Block Not A Structured Block

#pragma omp parallel if(count==1) goto more;


{ #pragma omp parallel
more: do_big_job(id); {
if(++count>1) goto more; more: do_big_job(id);
} if(++count>1) goto done;
printf(“ All done \n”); }
done: if(!really_done()) goto more;

ECE 563 Programming Parallel Machines 11


Structured Block Boundaries
• In C/C++: a block is a single statement or a group of
statements between brackets {}

#pragma omp parallel #pragma omp for


{ for (I=0;I<N;I++) {
id = omp_thread_num(); res[I] = big_calc(I);
A[id] = big_compute(id); A[I] = B[I] + res[I];
} }

ECE 563 Programming Parallel Machines 12


Structured Block Boundaries
• In Fortran: a block is a single statement or a group of
statements between directive/end-directive pairs.

C$OMP PARALLEL C$OMP PARALLEL DO


10 W(id) = garbage(id) do I=1,N
res(id) = W(id)**2 res(I)=bigComp(I)
if(res(id) goto 10 end do
C$OMP END PARALLEL C$OMP END PARALLEL DO

ECE 563 Programming Parallel Machines 13


OpenMP Parallel Regions
double A[1000];
omp_set_num_threads(4);
#pragma omp parallel
{
int ID = omp_get_thread_num();
double A[1000]; pooh(ID, A);
}
omp_set_num_threads(4) printf(“all done\n”);

A single
copy of “A”
is shared
pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A)
between all
threads.
printf(“all done\n”); Implicit barrier: threads wait here for
all threads to finish before proceeding
ECE 563 Programming Parallel Machines 14
The OpenMP API
Combined parallel work-share
• OpenMP shortcut: Put the “parallel” and the
work-share on the same line
int i; int i;
double res[MAX]; double res[MAX];
#pragma omp parallel #pragma omp parallel for
{ for (i=0;i< MAX; i++) {
#pragma omp for res[i] = huge();
for (i=0;i< MAX; i++) { }
res[i] = huge();
}
}
the same OpenMP
ECE 563 Programming Parallel Machines 15
Shared Memory Model
private private

thread2 • Data can be shared or


thread1
private
• Shared data is accessible
Shared by all threads
Memory • Private data can be
accessed only by the
thread3 thread that owns it
thread5
• Data transfer is transparent
thread4 to the programmer
private
private
private
ECE 563 Programming Parallel Machines 16
Data Environment:
Default storage attributes

• Shared Memory programming model


– Variables are shared by default
• Distributed Memory Programming Model
– All variables are private

ECE 563 Programming Parallel Machines 17


Data Environment:
Default storage attributes

• Global variables are SHARED among threads


– Fortran: COMMON blocks, SAVE variables, MODULE variables
– C: File scope variables, static

• But not everything is shared...


– Stack variables in sub-programs called from parallel regions
are PRIVATE
– Automatic variables within a statement block are PRIVATE.

ECE 563 Programming Parallel Machines 18


Data Environment
int A[100]; /* (Global) SHARED */

int main()
{
int ii, jj; /* PRIVATE */
int foo(int x)
int B[100]; /* SHARED */
{
#pragma omp parallel private(jj)
/* PRIVATE */
{
int count=0;
int kk = 1; /* PRIVATE */
return x*count;
#pragma omp for
}
for (ii=0; ii<N; ii++)
for (jj=0; jj<N; jj++)
A[ii][jj] = foo(B[ii][jj]);
}
}
ECE 563 Programming Parallel Machines 19
Work Sharing Construct
Loop Construct
#pragma omp for [clause[[,] clause …] new-line
for-loops

Where clause is one of the following:


private / firstprivate / lastprivate(list)
reduction(operator: list)
schedule(kind[, chunk_size])
collapse(n)
ordered
nowait

ECE 563 Programming Parallel Machines 20


Schedule
for (i=0; i<1100; i++)
A[i] = … ;

#pragma omp parallel for schedule (static, 250) or (static)

250 250 250 250 100 or 275 275 275 275


p0 p1 p2 p3 p0 p0 p1 p2 p3
#pragma omp parallel for schedule (dynamic, 200)

200 200 200 200 200 100


p3 p0 p2 p3 p1 p0
#pragma omp parallel for schedule (guided, 100)

137 120 105 100 100 100 100 100 100 100 38
p0 p3 p0 p1 p2 p3 p0 p1 p2 p3 p0

#pragma omp parallel for schedule (auto)


ECE 563 Programming Parallel Machines 21
Critical Construct
sum = 0;
#pragma omp parallel private (lsum)
{
lsum = 0;
#pragma omp for
for (i=0; i<N; i++) {
lsum = lsum + A[i];
}
#pragma omp critical
{ sum += lsum; } Threads wait their turn;
} only one thread at a time
executes the critical section

ECE 563 Programming Parallel Machines 22


Reduction Clause
Shared variable

sum = 0;
#pragma omp parallel for reduction (+:sum)
for (i=0; i<N; i++)
{
sum = sum + A[i];
}

ECE 563 Programming Parallel Machines 23


Performance Evaluation
• How do we measure performance? (or
how do we remove noise?)
#define N 24000
For (k=0; k<10; k++)
{
#pragma omp parallel for private(i, j)
for (i=1; i<N-1; i++)
for (j=1; j<N-1; j++)
a[i][j] = (b[i][j-1]+b[i][j+1])/2.0;
}

ECE 563 Programming Parallel Machines 24


Performance IssuesSpeedup
• What if you see
a speedup saturation?
# CPUs
1 2 4 6 8

#define N 12000
#pragma omp parallel for private(j)
for (i=1; i<N-1; i++)
for (j=1; j<N-1; j++)
a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1]
b[i-1][j]+b[i+1][j])/5.0;

ECE 563 Programming Parallel Machines 25


Performance Issues
Speedup

• What if you see


a speedup saturation?
# CPUs
1 2 4 6 8

#define N 12000
#pragma omp parallel for private(j)
for (i=1; i<N-1; i++)
for (j=1; j<N-1; j++)
a[i][j] = b[i][j];

ECE 563 Programming Parallel Machines 26


Loop Scheduling

• Any guideline for a chunk size?

#define N <big-number>

chunk = ???;
#pragma omp parallel for schedule (static, chunk)
for (i=1; i<N-1; i++)
a[i][j] = ( b[i-2] + b[i-1] + b[i]
b[i+1] + b[i+2] )/5.0;

ECE 563 Programming Parallel Machines 27


Performance Issues

• Load imbalance: triangular access pattern

#define N 12000
#pragma omp parallel for private(j)
for (i=1; i<N-1; i++)
for (j=i; j<N-1; j++)
a[i][j] = (b[i][j-1]+b[i][j]+b[i][j+1]
b[i-1][j]+b[i+1][j])/5.0;

ECE 563 Programming Parallel Machines 28


Summary
• OpenMP has advantages
– Incremental parallelization
– Compared to MPI
• No data partitioning
• No communication scheduling

ECE 563 Programming Parallel Machines 29


Resources

http://www.openmp.org
http://openmp.org/wp/resources

ECE 563 Programming Parallel Machines 30

You might also like