Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lab # 2 by Akram

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Lab # 2: Introduction to OpenMP (Part 1)

Objectives:
1- Know the primary components of OpenMP API.
2- Know how to parallelize portions of your application.

OpenMP:
 OpenMP is an acronym for Open Multi-Processing.
 OpenMP is An Application Programming Interface (API) for
developing multithreading parallel program in shared memory
architecture.
 OpenMP is an Application Program Interface (API), jointly defined by
a group of major computer hardware and software vendors. OpenMP
provides a portable, scalable model for developers of shared memory
parallel applications. The API supports C/C++ and Fortran on a wide
variety of architectures.
 The OpenMP API is comprised of three distinct components:

1- Compiler Directives

2- Runtime Library Routines

3- Environment Variables
 OpenMP is a library for parallel programming in the SMP (symmetric
multi-processors, or shared-memory processors) model. When
programming with OpenMP, all threads share memory and data.
OpenMP supports C, C++ and Fortran. The OpenMP functions are
included in a header file called omp.h .
 OpenMP program structure: An OpenMP program has sections that
are sequential and sections that are parallel. In general an OpenMP
program starts with a sequential section in which it sets up the
environment, initializes the variables, and so on.
 When run, an OpenMP program will use one thread (in the sequential
sections), and several threads (in the parallel sections).
 There is one thread that runs from the beginning to the end, and it's
called the master thread. The parallel sections of the program will
cause additional threads to fork. These are called the slave threads.
 A section of code that is to be executed in parallel is marked by a
special directive (omp pragma). When the execution reaches a parallel
section (marked by omp pragma), this directive will cause slave
threads to form. Each thread executes the parallel section of the code
independently. When a thread finishes, it joins the master. When all
threads finish, the master continues with code following the parallel
section.
 Each thread has an ID attached to it that can be obtained using a
runtime library function (called omp_get_thread_num()). The
ID of the master thread is 0.
 Why OpenMP? More efficient, and lower-level parallel code is
possible, however OpenMP hides the low-level details and allows the
programmer to describe the parallel code with high-level constructs,
which is as simple as it can get.
About threads:
Any modern operating system makes it possible to create threads. If we
create, for example, 4 threads, and we run the code on a computer with 4
CPU cores, we can usually count on the operating system to do the sensible
thing: it will typically assign each thread to a separate CPU core, and hence,
if all goes well, we can do (almost) 4 times as much useful work per second
as previously.

There are many ways of creating threads. For example, Unix-like operating
systems have a low-level interface called pthreads. The C++ programming
language introduced the thread support library in version C++11; this
provides a bit higher-level interface on top of the threading primitives
provided by the operating system.

However, using such libraries directly to speed up computations in our


application would take a nontrivial amount of code. In this lab, we will use a
convenient high-level interface called OpenMP.
 OpenMP: multithreading made easy
 OpenMP is an extension of the C, C++, and Fortran programming
languages. It is standardized and widely supported. For example, the
GCC compiler has a built-in support for it. To enable OpenMP
support, we will just need to use the command line switch -fopenmp
both to compile the code and to link the code. So our compilation
command changes from

Code written with OpenMP looks a bit strange at first, but once you get used
to it, it is very convenient and quick to use. The basic idea is that you as a
programmer add #pragma omp directives in the source code, and these
directives specify how OpenMP is supposed to divide work among multiple
threads.

OpenMP parallel for loops:


#pragma omp parallel is a directive can be used before for loop that you
want to parallelize. Here is a very simple example:

Without the #pragma, the code would be executed as follows:


However, the #pragma omp parallel for directive instructs the compiler to
generate code that splits the iterations of the loop among multiple threads.
For example, if we have 4 CPU cores, OpenMP typically uses 4 threads. The
end result looks like this:

OpenMP will take care of creating threads for you. It will maintain a pool of
threads that is readily available whenever needed. It will also take care of
synchronization: for example, in the above example the program does not
continue until all threads have completed their part of the parallel for loop.
#include <omp.h>

#include <stdio.h>

#include <stdlib.h>

# define NUM_THREADS 16

int main(){

int a = 0;

#pragma omp parallel reduction (+:a) num_threads(NUM_THREADS)

a = 1; // Assigns a value to the private copy.

// Each thread increments the value by 1.

printf("The value of a in the master thread %d\n", a);


return 0;

As shown in the output above, the value of private copies of a all thread are
added and stored in the copy of the global copy of a in the master thread.

#include <omp.h>

#include <stdio.h>

#include <stdlib.h>

# define NUM_THREADS 4

int main()

int a = 2;

int b=1;

int c=7;

#pragma omp parallel reduction(+:a) reduction(*:b)


num_threads(NUM_THREADS)

{
a = 1; // Assigns a value to the private copy

b=2;

// This will only implemented once by the master thread the value of
c only in the master thread will become 32

#pragma omp master

c=23;

printf("Values of a, b, and c in master thread are %d, %d, %d


respectively.\n", a, b, c);

return 0;

Each thread has a private copy of a=1, it will added to the copy of master thread
(a=2). Therefore, the value of a outside the parallel region in the master thread
become 6. The c=23 inside the parallel region is only performed by master
thread, therefore the value of c become equal 23 ouside the parallel region
performed by master thread. Each thread has a private copy of b=2, all of them
will be multiplied by each of other and then multiplied by the latest value of b
before entering the parallel region. Therefore, the value of a after performing
the parallel region is equal (2*2*2*2) * 1= 16.

Warning! Stay safe!:


Whenever you ask OpenMP to parallelize a for loop, it is your
responsibility to make sure it is safe.

For example, parallelizing the following loop is safe if you can safely
execute the operations c(0), c(1), …, c(9) simultaneously in parallel:
But exactly when can we execute two operations X and Y simultaneously in
parallel? We will discuss this in more detail later in next labs, but fo r now
the following rules of thumb are enough:

 There must not be any shared data element that is read by X and
written by Y.
 There must not be any shared data element that is written by X and
written by Y.

Here a “data element” is, for example, a scalar variable or an array element.

For example, the following loop cannot be parallelized; iteration 0 writes


to x[1] and iteration 1 reads from the same element:

The following loop cannot be parallelized either; iteration 0 writes to y[0]


and iteration 1 writes to the same element:

But parallelizing the following code is perfectly fine, assuming x and y are
pointers to distinct array elements and function f does not have any side
effects.
An example of a safe parallelizing to program code using OpenMP:
void step(float* r, const float* d, int n) {
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
float v =numeric_limits<float>::infinity();
for (int k = 0; k < n; ++k) {
float x = d[n*i + k];
float y = d[n*k + j];
float z = x + y;
v = min(v, z);
}
r[n*i + j] = v;
}
}
}

Here it is perfectly safe to parallelize the outermost for loop:

 Variable j is local (introduced inside the loop), not shared.


 Variable n is read-only.
 Array d is read-only.
 Array r is write-only.
 Different iterations of the loop write to different elements of array r;
the same element is never written by two different iterations.

Hence parallelizing the loop is as easy as adding a single pragma in the


right place:
void step(float* r, const float* d, int n) {
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
float v = numeric_limits<float>::infinity();
for (int k = 0; k < n; ++k) {
float x = d[n*i + k];
float y = d[n*k + j];
float z = x + y;
v = min(v, z);
}
r[n*i + j] = v;
}
}
}

This is really everything that we need to do! We can compile this with the -
fopenmp flag, run it, and it will make use of all CPU cores. For example,
when n = 4000, using a computer with 4 cores, each thread will run 1000
iterations of the outermost for loop.

You might also like