Lab # 2 by Akram
Lab # 2 by Akram
Lab # 2 by Akram
Objectives:
1- Know the primary components of OpenMP API.
2- Know how to parallelize portions of your application.
OpenMP:
OpenMP is an acronym for Open Multi-Processing.
OpenMP is An Application Programming Interface (API) for
developing multithreading parallel program in shared memory
architecture.
OpenMP is an Application Program Interface (API), jointly defined by
a group of major computer hardware and software vendors. OpenMP
provides a portable, scalable model for developers of shared memory
parallel applications. The API supports C/C++ and Fortran on a wide
variety of architectures.
The OpenMP API is comprised of three distinct components:
1- Compiler Directives
3- Environment Variables
OpenMP is a library for parallel programming in the SMP (symmetric
multi-processors, or shared-memory processors) model. When
programming with OpenMP, all threads share memory and data.
OpenMP supports C, C++ and Fortran. The OpenMP functions are
included in a header file called omp.h .
OpenMP program structure: An OpenMP program has sections that
are sequential and sections that are parallel. In general an OpenMP
program starts with a sequential section in which it sets up the
environment, initializes the variables, and so on.
When run, an OpenMP program will use one thread (in the sequential
sections), and several threads (in the parallel sections).
There is one thread that runs from the beginning to the end, and it's
called the master thread. The parallel sections of the program will
cause additional threads to fork. These are called the slave threads.
A section of code that is to be executed in parallel is marked by a
special directive (omp pragma). When the execution reaches a parallel
section (marked by omp pragma), this directive will cause slave
threads to form. Each thread executes the parallel section of the code
independently. When a thread finishes, it joins the master. When all
threads finish, the master continues with code following the parallel
section.
Each thread has an ID attached to it that can be obtained using a
runtime library function (called omp_get_thread_num()). The
ID of the master thread is 0.
Why OpenMP? More efficient, and lower-level parallel code is
possible, however OpenMP hides the low-level details and allows the
programmer to describe the parallel code with high-level constructs,
which is as simple as it can get.
About threads:
Any modern operating system makes it possible to create threads. If we
create, for example, 4 threads, and we run the code on a computer with 4
CPU cores, we can usually count on the operating system to do the sensible
thing: it will typically assign each thread to a separate CPU core, and hence,
if all goes well, we can do (almost) 4 times as much useful work per second
as previously.
There are many ways of creating threads. For example, Unix-like operating
systems have a low-level interface called pthreads. The C++ programming
language introduced the thread support library in version C++11; this
provides a bit higher-level interface on top of the threading primitives
provided by the operating system.
Code written with OpenMP looks a bit strange at first, but once you get used
to it, it is very convenient and quick to use. The basic idea is that you as a
programmer add #pragma omp directives in the source code, and these
directives specify how OpenMP is supposed to divide work among multiple
threads.
OpenMP will take care of creating threads for you. It will maintain a pool of
threads that is readily available whenever needed. It will also take care of
synchronization: for example, in the above example the program does not
continue until all threads have completed their part of the parallel for loop.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
# define NUM_THREADS 16
int main(){
int a = 0;
As shown in the output above, the value of private copies of a all thread are
added and stored in the copy of the global copy of a in the master thread.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
# define NUM_THREADS 4
int main()
int a = 2;
int b=1;
int c=7;
{
a = 1; // Assigns a value to the private copy
b=2;
// This will only implemented once by the master thread the value of
c only in the master thread will become 32
c=23;
return 0;
Each thread has a private copy of a=1, it will added to the copy of master thread
(a=2). Therefore, the value of a outside the parallel region in the master thread
become 6. The c=23 inside the parallel region is only performed by master
thread, therefore the value of c become equal 23 ouside the parallel region
performed by master thread. Each thread has a private copy of b=2, all of them
will be multiplied by each of other and then multiplied by the latest value of b
before entering the parallel region. Therefore, the value of a after performing
the parallel region is equal (2*2*2*2) * 1= 16.
For example, parallelizing the following loop is safe if you can safely
execute the operations c(0), c(1), …, c(9) simultaneously in parallel:
But exactly when can we execute two operations X and Y simultaneously in
parallel? We will discuss this in more detail later in next labs, but fo r now
the following rules of thumb are enough:
There must not be any shared data element that is read by X and
written by Y.
There must not be any shared data element that is written by X and
written by Y.
Here a “data element” is, for example, a scalar variable or an array element.
But parallelizing the following code is perfectly fine, assuming x and y are
pointers to distinct array elements and function f does not have any side
effects.
An example of a safe parallelizing to program code using OpenMP:
void step(float* r, const float* d, int n) {
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
float v =numeric_limits<float>::infinity();
for (int k = 0; k < n; ++k) {
float x = d[n*i + k];
float y = d[n*k + j];
float z = x + y;
v = min(v, z);
}
r[n*i + j] = v;
}
}
}
This is really everything that we need to do! We can compile this with the -
fopenmp flag, run it, and it will make use of all CPU cores. For example,
when n = 4000, using a computer with 4 cores, each thread will run 1000
iterations of the outermost for loop.