09 ParallelizationRecap PDF
09 ParallelizationRecap PDF
09 ParallelizationRecap PDF
October 9, 2018
1 09 Parallelization Recap
In [1]: from IPython.display import Image
import re
import numpy as np
import matplotlib.pyplot as plt
In [2]: %alias clean rm -f *.c *.exe *.py *.pyc *.s .*f95 *.o *.fo *log *.h *.cc *.mod *.ppm *.p
In [3]: %clean
export PATH=/usr/local/gcc-7.3.0/bin:$PATH
export COMPILER_PATH=/usr/local/gcc-7.3.0/bin
export LIBRARY_PATH=/usr/local/gcc-7.3.0/lib:/usr/local/gcc-7.3.0/lib64:$LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/gcc-7.3.0/lib:/usr/local/gcc-7.3.0/lib64:$LD_LIBRARY_PATH
export CPATH=/usr/local/gcc-7.3.0/include
export C_INCLUDE_PATH=/usr/local/gcc-7.3.0/include
export CPLUS_INCLUDE_PATH=/usr/local/gcc-7.3.0/include
export OBJC_INCLUDE_PATH=/usr/local/gcc-7.3.0/include
export CC=/usr/local/gcc-7.3.0/bin/gcc
export CXX=/usr/local/gcc-7.3.0/bin/g++
export FC=/usr/local/gcc-7.3.0/bin/gfortran
export CPP=/usr/local/gcc-7.3.0/bin/cpp
export MANPATH=/usr/local/gcc-7.3.0/share/man:${MANPATH}
1.1 Goals
1. To have a brief review of what we have discussed
2. To have a llok at few additional topics
1
1.2 Outline
Section ??
Section ??
Section ??
Section ??
In [5]: Image("pictures/1200px-Moore's_Law_Transistor_Count_1971-2016.png")
Out[5]:
In [6]: Image("pictures/power2.png")
2
Out[6]:
In [7]: Image("pictures/more_cores.png")
Out[7]:
3
hence to get full advantage of current CPU you need to know a little bit of parallelism
CPUs are based on a relatively simple paradigm,
In [8]: Image("pictures/von_neumann.png")
Out[8]:
4
but a single core is a complex object:
In [9]: Image("pictures/onecore.png")
Out[9]:
5
Instructions and data must be continuously fed to the control and arithmetic units, so that
the speed of the memory interface poses a limitation on compute performance (von Neumann
bottleneck).
The architecture is inherently sequential, processing a single instruction with (possibly) a sin-
gle operand or a group of operands from memory.
And we have several layers of optimization:
In [10]: Image("pictures/perfo_dim.png")
Out[10]:
6
• Sharing of compute load with a communication protocol:
– (Gigabit) Ethernet
– High-speed fabric (e. g. Section ?? or Section ??).
– Each node may be itself a NUMA archicture with separated multicore sockets.
In [11]: Image("pictures/networks.png")
Out[11]:
7
2.0.2 Memory layout
Computer memory is organized in a hierarchy of decreasing price and speed and increasing ca-
pacity. Caches are small and fast and main memory is big and slow.
So chaches need to map memory several memory locations (as a juggler does with balls).
Cache divided into cache lines: when memory is read into cache, a whole cache line is always
read at the same time.
Good if we have data locality: nearby memory accesses will be fast.
Typical cache line size: 64-128 bytes
Programs often re-use nearby memory locations: - Temporal locality: probably we will need
again soon data from a location - Spatial locality: nearby addresses likely to be needed soon
Example: reuse nearby elements in dense matrix multiplication (one of the past examples)
How cheap are FLOPs?
Rough estimate for one socket of the Section ??
Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz with avx512:
• 14 cores
• 8 vector lanes per core
• 14 cores * 2.6 GHz * 2 FMA * 32 IPC * 8 vector lanes ~ 1165 GFLOP/s
• 1165 GFLOP/s * 8 bytes ~ 9.3 TFLOP/s
8
Maximum RAM bandwidth~ 0.12 TB/s. Hence 9.3/0.12 ~ 77.5 FLOPS per memory access.
Therefore, on this platform to have a compute bound application you must achieve approximately
78 FLOPS x access.
Strip mining and Cache Blocking
Suppose we have the following code:
at each step we use a b[j] then throw it away and load the next. For n large with respect to
cache lines size, performance is determined by memory bandwidth.
now suppose we partially unroll manually the inner loop:
where TILE is the cache line size. This is called strip mining (too much jargon) and, among
other things may help the compiler to vectorize code. But we have still a memory bound snippet.
what happens if the loop are permuted?
we load TILE elements from b[j] and then use it for all computations, thereby increasing
temporal locality. This is called also cache blocking with block of size TILE.
In [12]: Image("pictures/cache_block.png")
Out[12]:
9
Compiler optimization
Once we have set up a decent piece of code we can ask the copiler to help a little bit:
-finline [enabled]
-finline-atomics [enabled]
-finline-functions-called-once [enabled]
Arrays of Structures vs Structures of arrays Suppose you to write a code simulating the the
behaviour of many particle system, under some physical model, that is integrating its equation of
motion. You need data structures to hold basic information such positions, velocities, etc.
One way could be:
then would be easy to form arrays of particles particle System[nparticles] and operate on
them. Adding new features to particles would involve changing the structure and some functions
for specific tasks, with (possibly) little modifications.
Using C++, classes and OOP would make it even more portable.
However what happens when run into a loop like that:
locality is not guaranteed and is very difficult for the compiler to optimise aggresively (and
prevent vectorization).
Alternatively:
10
#define maxp 1e6;
typedef ParticleSystem
{
float Positions = malloc(maxp * 3 * sizeof(float);
...
} System;
...
this is a common wisdom for performance oriented development and is called to create struc-
tures of arrays instead of arrays fo structures (Section ??).
11
Load imbalance
The slowest process determines when everyone is done. Time waiting for other processes to finish
is time wasted.
Communication overhead
A cost only incurred by the parallel program. Grows with the number of processes for collective
communication.
In [14]: Image("pictures/parallel.png")
Out[14]:
Amdhal Law
Tser Tser 1−α
Tpar = = Tser ∗ f s + f p ∗ = (α + ) ∗ Tser
p p p
Tser 1
S= =
Tpar α + 1−p α
Weak Scaling
S( P) → S( P, N )
Tser 1
S( P, N ) = =
Tpar α + 1−p α
N→∞⇒α→0
S( P, N )α→0 = P
12
3 1 - Vectorization
• Perform one operation on multiple elements of a vector
• 2000+ SSE instruction sets (several versions of them, from SSE1 to SEE4.2): 128-bit registers.
13
_<vector_size/instruction set>_<intrin_op>_<prec_suffix>
• is mm for 128 bit vectors (SSE), mm256 for 256 bit vectors (AVX and AVX2), and mm512 for
AVX512.
• Declares the operation of the intrinsic function. I.e. add, sub, mul ...
• Indicates the datatype. ps is for float, pd for double, and ep is for integer datatypes: epi32
for signed 32 bit integer, epu16 for unsigned 16 bit integer ...
For instance a vector intrisic multiplication for double vectors with AVX:
_mm256_mul_pd(&a , &a)
_mm256_sqrt_pf(&a).
1. Random numbers
2. Integer division
float y[8] = {10., 20., 30., 40., 50. ,60., 70., 80.};
14
__m128 Four = _mm_load_ps(y);
for(int i=0; i < 2; i++)
for(int v=0; v < 4; v++)
printf("%d %d %f %f \n",v,i*4+v,y[i*4+v],Four[i*4+v]);
return 0;
}
Writing access_128.c
In [17]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
gcc -Wall -O2 -march=native -mtune=native -msse4.2 -o access.exe access_128.c
./access.exe
rm -f access.exe
0 0 10.000000 10.000000
1 1 20.000000 20.000000
2 2 30.000000 30.000000
3 3 40.000000 40.000000
0 4 50.000000 10.000000
1 5 60.000000 20.000000
2 6 70.000000 30.000000
3 7 80.000000 40.000000
15
const int steps = 1e7;
double *arr = calloc(sizeof(double), steps);
printf("non vectorized\n");
double t0 = omp_get_wtime();
for(int i = 0; i < steps; i++)
arr[i] = (double ) i;
double time = 1000.*(omp_get_wtime() - t0);
printf("time=%f\n", time);
printf("intrisics\n");
int dim = 2; //SSE
t0 = omp_get_wtime();
for(int i = 0; i < steps/dim; i += dim)
{
__m128d tmp = {(double ) i+1,(double ) i};
_mm_storeu_pd(&arr[i], tmp);
}
if( steps % dim != 0)
for(int i = (steps/dim)*dim; i < steps; i++)
arr[i] = (double ) i;
time = 1000.*(omp_get_wtime() - t0);
printf("time=%f\n", time);
printf("OpenMP\n");
t0 = omp_get_wtime();
#pragma omp simd
for(int i=0; i < steps; i++)
arr[i] = (double ) i;;
time = 1000.*(omp_get_wtime() - t0);
printf("time=%f\n", time);
return 0;
}
Writing intr_vs_pragma.c
In [19]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
gcc -Wall -O2 -march=native -mtune=native -msse4.2 -fopenmp -o intr_vs_pragma.exe intr_
echo $OMP_NUM_THREADS
./intr_vs_pragma.exe
rm -f intr_vs_pragma.exe
non vectorized
time=21.300893
16
intrisics
time=4.626742
OpenMP
time=9.601765
In [20]: Image("pictures/gears.png")
Out[20]:
** Solution 1.0 **
double I = 0.;
const double dx = (end-start) / ((double ) steps);
double t0 = omp_get_wtime();
for(int i=0; i < steps; i++)
{
double mid_x = start + dx * ((double ) i + 0.5);
double mid_y = 1.0/(1. + mid_x*mid_x);
I += mid_y;
}
I = 4. * I * dx;
double time = 1000.* (omp_get_wtime() - t0);
printf("PI=%f, evaluated in %f ms\n",I, time);
17
I = 0.;
t0 = omp_get_wtime();
printf("Evalute PI in vectorized loop\n");
__m128d num = _mm_set_pd(4.,4.);
__m128d IV = _mm_set_pd(0.,0.);
for(int i = 0; i < steps; i = i+2)
{
__m128d x = _mm_set_pd(i*dx, (i+1)*dx);
IV =_mm_add_pd(IV,_mm_div_pd(num,1+_mm_mul_pd(x,x)));
}
for(int i = 0; i < 2; i++)
I += IV[i];
I = I * dx;
time = 1000.* (omp_get_wtime() - t0);
printf("PI=%f, evaluated in %f ms\n",I, time);
return 0;
}
Writing integral_intrisic.c
In [22]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
gcc -Wall -O2 -march=native -mtune=native -msse4.2 -fopenmp -o integral_intrisic.exe in
./integral_intrisic.exe
4 2 - OpenMP
Shared memory computer : any computer composed of multiple processing elements that share
an address space.
Two Classes: - Symmetric multiprocessor (SMP): a shared address space with “equal-time”
access for each processor, and the OS treats every processor the same way. (Real SMP have not
existed for many years, see slides of class#1)
• Non Uniform address space multiprocessor (NUMA;you have one of these): different mem-
ory regions have different access costs ... think of memory segmented into “Near” and “Far”
memory.
Process - An instance of a program execution. - The execution context of a running program ...
i.e. the resources associated with a program’s execution.
In [23]: Image("pictures/process.png")
18
Out[23]:
Thread:
In [24]: Image("pictures/Thread.png")
Out[24]:
19
** A shared memory program**
20
– Change how data is accessed to minimize the need for synchronization.
OpenMP i. e. Open Multi Processing an Application Program Interface (API) for developing
multithreaded applications.
OpenMP core syntax
Most of the constructs in OpenMP are compiler directives. #pragma omp construct [clause
[clause]...]
Example #pragma omp parallel num_threads(4)
Function prototypes and types in the file: #include
Most OpenMP constructs apply to a structured block: - Structured block: a block of one or more
statements with one point of entry at the top and one point of exit at the bottom.
Parallel regions are created with OpenMP whenever you invoke #pragma omp parallel. From
that point until the end of the region (structured block closed or !$OMP end parallel) a team of
threads is spawned and the code is executed indepently by each thread.
In [25]: Image("pictures/fork_join.jpeg")
Out[25]:
• Unintended sharing of data causes race conditions in which the program’s results depend
on thread scheduling.
• To control race conditions: use synchronization to protect data conflicts.
• Synchronization is expensive so: change how data is accessed to minimize the need for
synchronization.
• Synchronization may use barriers (all threads in a team) or mutual exclusions (mutexes) (a
pair of threads)
• Loop construct
• Single construct
• Master construct
• Sections construct
• Task construct
Sections construct
21
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
// structured block 1
}
#pragma omp section
{
// structured block 2
}
#pragma omp section
{
// structured block 3
}
...
}
}
sections starts the construct. It contains several section constructs which marks a different
block, containing a represents a task. (Beware of section and sections).
sections distributes the blocks/tasks between existing threads. The requirement is that each
block must be independent of the other blocks. Then each thread executes one block at a time.
Each block is executed only once by one thread.
Environment Variables
• OMP_DISPLAY_ENV
22
#include <omp.h>
//g. mancini july 18
int main(int argc, char **argv)
{
double I = 0.;
const double dx = (end-start) / ((double ) num_steps);
double t0 = omp_get_wtime();
std::cout << "Loop completed in "<< 1000.*(t1-t0) << " ms using " << std::endl;
std::cout << "Integral value = " << I << std::endl;
}
Writing integral_omp_red.cc
In [27]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
g++ -Wall -O2 -fopenmp -march=native -mtune=native -o integral.exe integral_omp_red.cc
OMP_NUM_THREADS=1 ./integral.exe
Loop completed in 690.51 ms using
Integral value = 3.14159
23
In [30]: plt.plot((1,2,4,8),result,marker='s',color='k',ls='-')
plt.ylabel('Wall time')
plt.xlabel('number of threads')
z2n+1 = z2n + c
where c is a constant and z0 = 0.
Points that do not diverge after a finite number of iterations are part of the set.
//settings
const int maxiter = 500;
double horiz = 2.0;
const int width = 1024;
const int height = 1024;
const double xmin = -2.;
const double xmax = 1.;
const double ymin = -1.2;
const double ymax = 1.2;
const char* filename = "out.ppm";
//variables
int k;
horiz = horiz*horiz;
double x0, y0, xres, yres;
xres = (xmax - xmin)/((double) width);
yres = (ymax - ymin)/((double) height);
double re, im, re2, im2;
24
FILE * fp = fopen(filename,"w");
fprintf(fp,"P3\n%d %d %d\n",width, height, 255);
//compute map for every pixel
//write black & white out file
}
Writing template.c
In [32]: Image("pictures/gears.png")
Out[32]:
** Solution 2.0 **
void write_output(char* filename, int* image, int width, int height, int maxiter)
25
{
26
}
y0 = y0 + yres;
}// close on i
printf("Spent %f seconds\n",omp_get_wtime()-t0);
write_output( filename, image, width, height, maxiter);
free(image);
}
Writing mandelbrot_serial.c
In [34]: %%bash
rm -f out.*
source $HOME/Dropbox/Slackware/gcc_vars.sh
gcc -Wall -O2 -march=native -mtune=native -fopenmp -o mandelbrot.exe mandelbrot_serial.
./mandelbrot.exe
#https://www.imagemagick.org/script/index.php
convert out.ppm out.png
In [35]: Image("out.png")
Out[35]:
27
In [36]: %%timeit
%%bash
./mandelbrot.exe
28
OpenMP version
void write_output(char* filename, int* image, int width, int height, int maxiter)
{
// open output PPM file, https://en.wikipedia.org/wiki/Netpbm_format#PPM_example
FILE * fp = fopen(filename,"w");
fprintf(fp,"P3\n%d %d %d\n",width, height, 255);
29
{
//settings
const int maxiter = 500;
const int width = 2048;
const int height = 2048;
const double xmin = -2.;
const double xmax = 1.;
const double ymin = -1.2;
const double ymax = 1.2;
char* filename = "out.ppm";
double horiz = 3.0;
//variables
horiz = horiz*horiz;
double xres, yres;
xres = (xmax - xmin)/((double) width);
yres = (ymax - ymin)/((double) height);
//allocate image
int *image = malloc(sizeof(int) * height * width);
// run
#ifdef _OPENMP
double t0 = omp_get_wtime();
int block = 8;
#pragma omp parallel for schedule(dynamic) //collapse(2)
for(int jj = 0; jj < width; jj+=block)
for (int i = 0; i < height; i++)
{
for (int j = jj; j < jj + block; j++)
{
double x0 = xmin + j*xres;
double y0 = ymin + i*yres;
image[i + j*height] = mandel(x0,y0,horiz,maxiter);
}
}// close on i
double t1 = omp_get_wtime();
printf("spent %f ms in parallel loop\n",(t1-t0)*1000.0);
#else
double y0 = ymin;
for (int i = 0; i< height; i++)
{
double x0 = xmin;
for (int j = 0; j < width; j++)
{
image[i + j*height] = mandel(x0,y0,horiz,maxiter);
x0 = x0 + xres;
30
}
y0 = y0 + yres;
}// close on i
#endif
write_output( filename, image, width, height, maxiter);
free(image);
return 0;
}
Writing mandelbrot_omp.c
In [39]: %%bash
rm -f out.*
source $HOME/Dropbox/Slackware/gcc_vars.sh
gcc -Wall -O2 -march=native -mtune=native -fopenmp -o mandelbrot.exe mandelbrot_omp.c -
OMP_NUM_THREADS=2 ./mandelbrot.exe
convert out.ppm out.png
In [40]: Image("out.png")
Out[40]:
31
In [41]: %%capture thr1
%%bash
OMP_NUM_THREADS=1 ./mandelbrot.exe
In [42]: thr1.stdout
In [43]: %%bash
OMP_NUM_THREADS=2 ./mandelbrot.exe
32
In [44]: %%bash
OMP_NUM_THREADS=4 ./mandelbrot.exe
In [46]: thr8.stdout
Final Speedup
In [47]: t1 = float(thr1.stdout.split()[1])
t8 = float(thr8.stdout.split()[1])
print("Final speedup ",t1/t8)
output depends on number of threads and execution sequence. The word order will be mixed.
33
}
printf("\n");
return 0;
}
this solves the problem but all threads but one are wasted. Using the task construct:
Writing sanrossore.c
In [49]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
gcc -Wall -O2 -fopenmp -o race.exe sanrossore.c
OMP_NUM_THREADS=1 ./race.exe
A race horse
In [50]: %%bash
gcc -Wall -O2 -fopenmp -o race.exe sanrossore.c
OMP_NUM_THREADS=2 ./race.exe
A race horse
Tasks are independent units of work. Threads are assigned to perform the work of each task.
Tasks are composed of:
- code to execute
34
- data environment (it own its data)
- internal control variables
Task barrier (taskwait): encountering thread suspends until all child tasks it has generated are
completed.
Tasks can be executed in arbitrary order and generated until there is some work to do:
In [51]: Image("pictures/tasks.png")
Out[51]:
int main()
{
#pragma omp parallel
{
#pragma omp single
{
printf("A ");
35
#pragma omp task
printf("race ");
#pragma omp task
printf("car ");
#pragma omp taskwait
printf(" or a horse one?");
} //single thread
} //parallel region
printf("\n");
return 0;
}
Overwriting sanrossore.c
In [53]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
gcc -Wall -O2 -fopenmp -o race.exe sanrossore.c
OMP_NUM_THREADS=2 ./race.exe
Flowchart:
• FindingConcurrency
– data vs. control
• AlgorithmStructure
– pipeline, replicate...
• SupportingStructure
– SPMD, fork/join...
36
• Implementation
– barriers, locks...
Common patterns:
• Divide phase:
– Breaks down problem into two or more sub-problems of the same (or related) type.
• Conquer phase:
– Executes the computations on each of the “indivisible” sub-problems.
– May also combine solutions of sub-problems until the solution of the original problem
is reached.
• Because the nature of recursion forms smaller sub-problems that are very much like the
larger problem being solved.
• The return from recursive calls can be used to combine partial solutions into an overall so-
lution.
In [54]: Image("pictures/dac.jpg")
Out[54]:
37
Fibonacci sequence with tasks
In [55]: Image("pictures/Quicksort.png")
Out[55]:
38
In [56]: %%writefile quicksort.c
#include <stdlib.h>
#include <stdio.h>
//https://rosettacode.org/wiki/Sorting_algorithms/Quicksort#C
temp = A[i];
A[i] = A[j];
A[j] = temp;
39
}
//left branch
quicksort(A, i);
//right branch
quicksort(A + i, len - i);
}
Writing quicksort.c
In [57]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
gcc -Wall -O2 -march=native -mtune=native -msse4.2 -o QS.exe quicksort.c
./QS.exe
Sorted array:
10 7 8 9 1 5 7 4
Unsorted array:
1 4 5 7 7 8 9 10
40
printf("Generating %d random integers in[%ld,%ld]\n",length,gsl_rng_min(R),gsl_rng_max(R));
for( int i = 0; i < length; i++)
Arr[i] = gsl_rng_uniform_int(R, length);
gsl_rng_free(R)
In [58]: Image("pictures/gears.png")
Out[58]:
temp = A[i];
41
A[i] = A[j];
A[j] = temp;
}
//left branch
#pragma omp task
quicksort(A, i);
//right branch
#pragma omp task
quicksort(A + i, len - i);
}
#ifdef VRB
printf("Unsorted array:\n");
printA(A, length);
#endif
double t0 = omp_get_wtime();
int nt;
#pragma omp parallel
{
nt = omp_get_num_threads();
{
#pragma omp single
quicksort(A, length);
}
}
42
printf("Sorted array:\n");
printA(A, length);
#endif
return 0;
}
Writing quicksort_omp.c
In [60]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
gcc -Wall -O2 -march=native -mtune=native -msse4.2 -o QS.exe quicksort_omp.c \
-fopenmp -lm -lgsl -lgslcblas -I/usr/include/gsl/ -DVRB
OMP_NUM_THREADS=1 ./QS.exe 20
In [61]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
OMP_NUM_THREADS=4 ./QS.exe 20
In [62]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
gcc -Wall -O2 -march=native -mtune=native -msse4.2 -o QS.exe quicksort_omp.c \
-fopenmp -lm -lgsl -lgslcblas -I/usr/include/gsl/
In [63]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
OMP_NUM_THREADS=1 ./QS.exe 20000000
In [64]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
OMP_NUM_THREADS=2 ./QS.exe 20000000
43
Generating 20000000 random integers in[0,20000000]
Sorted in 1.557683 with 2 threads
In [65]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
OMP_NUM_THREADS=4 ./QS.exe 20000000
In [66]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
OMP_NUM_THREADS=8 ./QS.exe 20000000
5 3 - MPI
MPI (Message Passing Interface) is standard specification for message passing.
• Portable.
MPI features
When an MPI program is run, multiple processes are executed and work of separate blocks of
data.
The collection of processes involved in a computationis called a process group.
Data is shared either with one-to-one:
44
• MPI_Send(),
• MPI_Recv()
or collective communications:
• MPI_Barrier()
• MPI_Gather()
• MPI_Scatter()
In [67]: Image("pictures/mpi_funcs.png")
Out[67]:
45
{
double sum = 0.;
for(int i = 0; i < dim; i++)
{
double sum_i = b[i];
for(int j = 0; j < dim; j++)
sum_i -= A[i][j] * xnew[j];
sum += pow(sum_i,2);
}
return sqrt(sum/(double ) dim);
}
void gen_jac_input(int DIM, int high, double **A, double *b, double *xo)
{
for (int i=0; i < DIM;i++)
{
b[i] = (double ) (rand() % high);
xo[i] = (double ) (rand() % high);
double sum = 0.;
46
{
47
double t0 = 0., t1 = 0.;
srand(23);
//allocate
double **A = malloc(DIM * sizeof(double*));
A[0] = malloc(DIM * DIM * sizeof(double));
for(int i = 1; i < DIM; i++)
A[i] = A[0] + i*DIM;
double *b = malloc(DIM * sizeof(double));
double *xo = malloc(DIM * sizeof(double));
double *xn = calloc(DIM, sizeof(double));
int rank = 0;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
//call jacobi
t0 = MPI_Wtime();
MPI_Finalize();
t1 = MPI_Wtime();
if(rank == 0)
{
if(niter < kmax)
printf("converged at %g after %d iterations in %g s\n",conv,niter,t1-t0);
else
printf("NOT converged (conv=%g) after %d iterations\n",conv,niter);
//check results
double error = check_jacobi(DIM, A, b, xn);
printf("error=%f\n",error);
}
}
Writing jacobi.c
In [69]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
mpicc -O2 -Wall -march=native -mtune=native -msse4.2 -lm -fopenmp -lmpi -o jacobi.exe j
mpirun -n 1 jacobi.exe 4000 100
48
converged at 8.85065e-09 after 45 iterations in 0.559049 s
error=0.294429
In [70]: %%bash
mpirun -n 2 jacobi.exe 4000 100
In [71]: %%bash
mpirun -n 4 jacobi.exe 4000 100
int MPI_Irecv(void *buffer, int count, MPI_Datatype datatype, int source, int tag,
MPI_Communicator comm, MPI_Request *request);
MPI_Request takes the role of MPI_Status in the blocking Recv but now is needed also by the
send call.
Completition of the request is made with MPI_Test:
49
In [72]: %%writefile ring_nb.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
//g. mancini july 18
int main(int argc, char **argv)
{
int rank, size, check, to, from;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
snprintf(msg,sizeof(msg),"%d%s%d%s%d",rank,hdr,to,ftr,from);
if (size > 1)
{
MPI_Isend(&msg[0], 35, MPI_BYTE, to, check, MPI_COMM_WORLD, &ReqS);
MPI_Irecv(rmsg,35, MPI_BYTE, from, from, MPI_COMM_WORLD, &ReqR);
MPI_Wait(&ReqS, &Stat);
MPI_Wait(&ReqR, &Stat);
printf("%s \n",rmsg);
}
MPI_Finalize();
50
}
Writing ring_nb.c
In [73]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
mpicc -O2 -Wall -o ring.exe ring_nb.c -march=native -mtune=native -lm -lmpi
mpirun -n 8 ring.exe
In [74]: Image("pictures/geom_decomp.png")
Out[74]:
51
Example: 1D heat equation
In [75]: Image("pictures/rod.png")
Out[75]:
52
Use first order and second order finite difference for time and space, respectively:
∂T T n +1 − T n
=
∂t δt
∂ T
2 Tj+1 − 2. · Tj + Tj−1
=
∂x 2 δx
then combine:
MPI_Init(0,NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Status status;
if (size > 1)
{
if(Ncell % size != 0)
Ncell = Ncell + Ncell % size;
53
chunk = Ncell / size + 2;
first = 2;
last = chunk-2;
Temp = calloc(chunk, sizeof(double));
Tnew = calloc(chunk, sizeof(double));
if(rank == 0)
Temp[1] = H0;
if(rank == size-1)
Temp[chunk-2] = H1;
}
else
{
first = 1;
last = Ncell - 1;
chunk = Ncell;
if(rank == 0)
printf("scalef, Ncell, chunk: %f %d %d \n",scalef,Ncell,chunk);
54
}
} //endif size > 1
//apply stencil
for(int cell = first; cell < last; cell++)
Tnew[cell] = Temp[cell] + scalef * (Temp[cell+1] - 2.*Temp[cell] + Temp[cel
if(size > 1)
{
// 0 is a ghost cell
if(rank != 0)
Tnew[1] = Temp[1] + scalef * (Temp[2] - 2.*Temp[1] + Temp[0]);
if (size > 1)
{
swap = calloc(Ncell, sizeof(double));
MPI_Allgather(&Tnew[1],last,MPI_DOUBLE,swap,last,MPI_DOUBLE,MPI_COMM_WORLD);
}
MPI_Finalize();
if (rank == 0)
{
char outname[30] = "temp_field_np_";
char SZ[10];
sprintf(SZ,"%d",size);
strcat(outname, SZ);
In [77]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
55
mpicc --version
mpicc -Wall -O2 -march=native -mtune=native -lm -lmpi -o heat.exe heat.c
In [78]: %%bash
mpirun -n 2 ./heat.exe
In [79]: %%bash
mpirun -n 4 ./heat.exe
In [80]: %%bash
paste temp_field_np_2 temp_field_np_4 | awk 'NR>1{a+=($2-$4)*($2-$4)}END{print a/(NR-1)
In [81]: Image("pictures/gears.png")
Out[81]:
56
5.0.3 Solution 3.0
In [82]: %%writefile heat_nb.c
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <mpi.h>
//g mancini may 18
int main()
{
double H0 = 100., H1 = 100.;
MPI_Init(0,NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Request Requests[4];
if (size > 1)
{
if(Ncell % size != 0)
Ncell = Ncell + Ncell % size;
chunk = Ncell / size + 2;
first = 2;
last = chunk-2;
Temp = calloc(chunk, sizeof(double));
Tnew = calloc(chunk, sizeof(double));
if(rank == 0)
Temp[1] = H0;
if(rank == size-1)
Temp[chunk-2] = H1;
}
else
{
first = 1;
last = Ncell - 1;
chunk = Ncell;
57
Temp[0] = H0;
Temp[Ncell-1] = H1;
}
if(rank == 0)
printf("scalef, Ncell, chunk: %f %d %d \n",scalef,Ncell,chunk);
//apply stencil
for(int cell = first; cell < last; cell++)
Tnew[cell] = Temp[cell] + scalef * (Temp[cell+1] - 2.*Temp[cell] + Temp[cel
//receive
if(size > 1)
{
if(rank != 0)
{
MPI_Wait(&(Requests[0]),MPI_STATUS_IGNORE);
MPI_Wait(&(Requests[2]),MPI_STATUS_IGNORE);
58
Tnew[1] = Temp[1] + scalef * (Temp[2] - 2.*Temp[1] + Temp[0]);
}
if(rank != size - 1)
{
MPI_Wait(&(Requests[1]),MPI_STATUS_IGNORE);
MPI_Wait(&(Requests[3]),MPI_STATUS_IGNORE);
Tnew[last] = Temp[last] + scalef * (Temp[last+1]
- 2.*Temp[last] + Temp[last-1]);
}
}
swap = Tnew;
Tnew = Temp;
Temp = swap;
} //end for time
if (size > 1)
{
swap = calloc(Ncell, sizeof(double));
MPI_Allgather(&Tnew[1],last,MPI_DOUBLE,swap,last,MPI_DOUBLE,MPI_COMM_WORLD);
}
MPI_Finalize();
if (rank == 0)
{
char outname[30] = "temp_field_np_";
char SZ[10];
sprintf(SZ,"%d",size);
strcat(outname, SZ);
FILE *fp = fopen(outname, "w");
for(int i = 0; i < Ncell; i++)
fprintf(fp,"%d %f\n", i, swap[i]);
fclose(fp);
}
return 0;
}
Writing heat_nb.c
In [83]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
mpicc --version
mpicc -Wall -O2 -march=native -mtune=native -lm -lmpi -o heat_nb.exe heat_nb.c
59
In [84]: %%bash
mpirun -n 4 ./heat_nb.exe
In [85]: %%bash
paste temp_field_np_2 temp_field_np_4 | awk 'NR>1{a+=($2-$4)}END{print a/(NR-1)}'
double t0 = MPI_Wtime();
if(num_steps % size != 0)
60
num_steps += num_steps/size;
int chunk = num_steps / size;
int i0 = rank*chunk;
int i1 = (rank+1)*chunk;
double I = 0.;
#pragma omp parallel for simd reduction(+:I)
for(int i = i0; i < i1; i++)
{
nt = omp_get_num_threads();
double mid_x = start + dx * ((double ) i + 0.5);
double mid_y = 4.0/(1. + mid_x*mid_x) ;
I += mid_y;
}
MPI_Reduce(&I, &PI, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
PI = PI * dx;
double t1 = MPI_Wtime();
MPI_Finalize();
if(rank==0)
{
printf("Loop completed in %f using %d procs and %d threads\n",1000.*(t1-t0),siz
printf("Integral value %f\n",PI);
}
}
Writing integral_mpi_omp.c
In [87]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
gcc -Wall -O2 -march=native -mtune=native -msse4.2 -fopenmp -fopenmp-simd -fopt-info-ve
-o I.exe integral_mpi_omp.c -lm -lmpi
In [88]: %%bash
source $HOME/Dropbox/Slackware/gcc_vars.sh
OMP_NUM_THREADS=4 mpirun -n 2 ./I.exe
Provided mode 1
Loop completed in 104.758742 using 2 procs and 4 threads
Integral value 3.141593
Caveats
61
Not all MPIs are threadsafe. MPI 2.0 defines threading modes:
- MPI_Thread_Single: no support for multiple threads.
- MPI_Thread_Funneled: Multiple threads, only master calls MPI.
- MPI_Thread_Serialized: Mult threads each calling MPI, but they do it one at a time.
- MPI_Thread_Multiple: Multiple threads without any restrictions.
Check out threading mode with:
MPI_init_thread(desired_mode, delivered_mode, ierr)
Environment variables are not propagated by mpirun.
Youll need to broadcast OpenMP parameters and set them with the library routines.
In [89]: Image("pictures/gears.png")
Out[89]:
1. MPI algorithms often require replicated data making them less memory efficient.
2. Fewer total MPI communicating agents means fewer messages and less overhead from
message conflicts.
3. Algorithms with good cache efficiency should benefit from shared caches of multi-threaded
programs.
The model maps perfectly with clusters of SMP nodes. But really, its a case by case basis and
to large extent depends on the particular application.
62