Parallel Programming Using MPI
Parallel Programming Using MPI
Morten Hjorth-Jensen
Department of Physics and Center of Mathematics for Applications University of Oslo, N-0316 Oslo, Norway
NOTUR 2011
1 / 69
Target group
You have some experience in programming but have never tried to parallelize your codes Here I will base my examples on C/C++ and Fortran using Message Passing Interface (MPI). In order to get you started I will show some simple examples using numerical integration I will also give you some simple hints on how to run and install codes with MPI on your laptop/PC. The programs can be found at the weblink of http://www.uio.no/studier/emner/matnat/fys/ FYS3150/index-eng.xml, see under the programs link of for example fall 2010. See also Xing Cais lectures, right after these. Good text: Karniadakis and Kirby, Parallel Scientic Computing in C++ and MPI, Cambridge.
2 / 69
Strategies
Develop codes locally, run with some few processes and test your codes. Do benchmarking, timing and so forth on local nodes, for example your laptop or PC. You can install MPICH2 on your laptop/PC. Many UiO PCs running Linux have MPICH2 installed. Test by typing which mpd When you are convinced that your codes run correctly, you start your production runs on available supercomputers, in our case titan.uio.no.
3 / 69
Here we declare that we will use 4 processes via the ncpus option and via n4 when running. End with
mpdallexit
4 / 69
Of course: go to http: //www.mcs.anl.gov/research/projects/mpich2/ follow the instructions and install it on your own PC/laptop If you use Ubuntu it is very simple sudo apt-get install mpich2 For windows, you may think of installing WUBI, see http://www.ubuntu.com/download/ubuntu/ windows-installer
5 / 69
Titan
Hardware
304 dual-cpu quad-core SUN X2200 Opteron nodes (total 2432 cores), 2.2 Ghz, and 8 - 16 GB RAM and 250 - 1000 GB disk on each node 3 eight-cpu quad-core Sun X4600 AMD Opteron nodes (total 96 cores), 2.5 Ghz, and 128, 128 and 256 GB memory, respectively Inniband interconnect Heterogenous cluster!
6 / 69
Titan
Software
Batch system: SLURM and MAUI Message Passing Interface (MPI):
OpenMPI Scampi MPICH2
Compilers: GCC, Intel, Portland and Pathscale Optimized math libraries and scientic applications All you need may be found under /site Available software: http: //www.hpc.uio.no/index.php/Titan_software
7 / 69
Getting started
Batch systems
A batch system controls the use of the cluster resources Submits the job to the right resource Monitors the job while executing Restarts the job in case of failure Takes care of priorities and queues to control execution order of unrelated jobs
Getting started
Modules
Different compilers, MPI-versions and applications need different sets of user environment variables The modules package lets you load and remove the different variable sets Useful commands:
List available modules: module avail Load module: module load <environment> Unload module: module unload <environment> Currently loaded: module list
http: //hpc.uio.no/index.php/Titan_User_Guide
9 / 69
Example
Interactively
# $ # $ # $ # $ $ $ $ # # $ # $ $ $ l o g i n to t i t a n ssh t i t a n . u i o . no ask f o r 4 cpus qlogin account=fys3150 n t a s k s =4 s t a r t a j o b setup , note t h e punktum ! source / s i t e / b i n / j o b s e t u p we want to use t h e i n t e l module module l o a d i n t e l module l o a d openmpi / 1 . 2 . 8 . i n t e l mkdir p fys3150 / mpiexample / cd fys3150 / mpiexample / Use program6 . cpp from t h e course pages , see c h a p t e r 4 compile t h e program mpic++ O3 o program6 . x program6 . cpp and execute i t mpirun . / program6 . x T r a p e z o i d a l r u l e = 3.14159 Time = 0.000378132 on number o f p r o c es s o r s : 4
10 / 69
11 / 69
Example
Submitting
# $ # $ $ l o g i n to t i t a n ssh t i t a n . u i o . no and submit i t sbatch j o b . slurm exit
12 / 69
Example
Checking execution
# check i f j o b i s r u n n i n g : $ showq u mhjensen ACTIVE JOBS JOBNAME USERNAME STATE 883129 1 A c t i v e Job mhjensen Running PROC 4 REMAINING 10:31:17 F r i Oct STARTTIME 2 13 : 5 9 : 2 5
PROC
WCLIMIT
QUEUETIME
STATE
PROC
WCLIMIT
QUEUETIME
T o t a l Jobs : 1
A c t i v e Jobs : 1
I d l e Jobs : 0
Blocked Jobs : 0
13 / 69
Admonitions
Remember to exit from qlogin-sessions; the resource is reserved for you untill you exit Dont run jobs on login-nodes; these are only for compiling and editing les
14 / 69
MPI is a library, not a language. It species the names, calling sequences and results of functions or subroutines to be called from C/C++ or Fortran programs, and the classes and methods that make up the MPI C++ library. The programs that users write in Fortran, C or C++ are compiled with ordinary compilers and linked with the MPI library. MPI programs should be able to run on all possible machines and run all MPI implementetations without change. An MPI computation is a collection of processes communicating with messages.
15 / 69
and Fortran-binding (routine names are in uppercase, but can also be in lower case)
MPI COMMAND NAME
16 / 69
MPI
MPI is a library specication for the message passing interface, proposed as a standard. independent of hardware; not a language or compiler specication; not a specic implementation or product. A message passing standard for portability and ease-of-use. Designed for high performance. Insert communication and synchronization functions where necessary.
17 / 69
Pursuit of shorter computation time and larger simulation size gives rise to parallel computing. Multiple processors are involved to solve a global problem. The essence is to divide the entire computation evenly among collaborative processors. Divide and conquer.
18 / 69
Conventional single-processor computers can be called SISD (single-instruction-single-data) machines. SIMD (single-instruction-multiple-data) machines incorporate the idea of parallel processing, which use a large number of process- ing units to execute the same instruction on different data. Modern parallel computers are so-called MIMD (multiple-instruction- multiple-data) machines and can execute different instruction streams in parallel on different data.
19 / 69
One way of categorizing modern parallel computers is to look at the memory conguration. In shared memory systems the CPUs share the same address space. Any CPU can access any data in the global memory. In distributed memory systems each CPU has its own memory. The CPUs are connected by some network and may exchange messages.
20 / 69
21 / 69
Message-passing all involved processors have an independent memory address space. The user is responsible for partitioning the data/work of a global problem and distributing the subproblems to the processors. Collaboration between processors is achieved by explicit message passing, which is used for data transfer plus synchronization. This paradigm is the most general one where the user has full control. Better parallel efciency is usually achieved by explicit message passing. However, message-passing programming is more difcult.
22 / 69
SPMD
Although message-passing programming supports MIMD, it sufces with an SPMD (single-program-multiple-data) model, which is exible enough for practical cases: Same executable for all the processors. Each processor works primarily with its assigned local data. Progression of code is allowed to differ between synchronization points. Possible to have a master/slave model. The standard option in Monte Carlo calculations and numerical integration.
23 / 69
Distributed memory is the dominant hardware conguration. There is a large diversity in these machines, from MPP (massively parallel processing) systems to clusters of off-the-shelf PCs, which are very cost-effective. Message-passing is a mature programming paradigm and widely accepted. It often provides an efcient match to the hardware. It is primarily used for the distributed memory systems, but can also be used on shared memory systems. In these lectures we consider only message-passing for writing parallel programs.
24 / 69
Uneven load balance: not all the processors can perform useful work at all time. Overhead of synchronization. Overhead of communication. Extra computation due to parallelization. Due to the above overhead and that certain part of a sequential algorithm cannot be parallelized we may not achieve an optimal parallelization.
25 / 69
Identify the part(s) of a sequential algorithm that can be executed in parallel. This is the difcult part, Distribute the global work and data among P processors.
26 / 69
MPI is a message-passing library where all the routines have corresponding C/C++-binding
MPI Command name
and Fortran-binding (routine names are in uppercase, but can also be in lower case)
MPI COMMAND NAME
27 / 69
Communicator
A group of MPI processes with a name (context). Any process is identied by its rank. The rank is only meaningful within a particular communicator. By default communicator MPI COMM WORLD contains all the MPI processes. Mechanism to identify subset of processes. Promotes modular design of parallel libraries.
28 / 69
29 / 69
PROGRAM h e l l o INCLUDE "mpif.h" INTEGER : : size , my rank , i e r r CALL MPI INIT ( i e r r ) CALL MPI COMM SIZE (MPI COMM WORLD, size , i e r r ) CALL MPI COMM RANK(MPI COMM WORLD, my rank , i e r r ) WRITE ( , ) "Hello world, Ive rank " , my rank , " out of " , s i z e CALL MPI FINALIZE ( i e r r ) END PROGRAM h e l l o
31 / 69
Note 1
The output to screen is not ordered since all processes are trying to write to screen simultaneously. It is then the operating system which opts for an ordering. If we wish to have an organized output, starting from the rst process, we may rewrite our program as in the next example.
32 / 69
i n t main ( i n t nargs , char args [ ] ) { i n t numprocs , my rank , i ; M P I I n i t (& nargs , &args ) ; MPI Comm size (MPI COMM WORLD, &numprocs ) ; MPI Comm rank (MPI COMM WORLD, &my rank ) ; f o r ( i = 0 ; i < numprocs ; i ++) {} M P I B a r r i e r (MPI COMM WORLD) ; i f ( i == my rank ) { c o u t << "Hello world, I have rank " << my rank << " out of " << numprocs << e n d l ; } MPI Finalize ( ) ;
33 / 69
Note 2
Here we have used the MPI Barrier function to ensure that that every process has completed its set of instructions in a particular order. A barrier is a special collective operation that does not allow the processes to continue until all processes in the communicator (here MPI COMM WORLD ) have called MPI Barrier . The barriers make sure that all processes have reached the same point in the code. Many of the collective operations like MPI ALLREDUCE to be discussed later, have the same property; viz. no process can exit the operation until all processes have started. However, this is slightly more time-consuming since the processes synchronize between themselves as many times as there are processes. In the next Hello world example we use the send and receive functions in order to a have a synchronized action.
34 / 69
35 / 69
Note 3
The basic sending of messages is given by the function MPI SEND , which in C/C++ is dened as
i n t MPI Send ( void buf , i n t count , MPI Datatype datatype , i n t dest , i n t tag , MPI Comm comm) }
This single command allows the passing of any kind of variable, even a large array, to any group of tasks. The variable buf is the variable we wish to send while count is the number of variables we are passing. If we are passing only a single value, this should be 1. If we transfer an array, it is the overall size of the array. For example, if we want to send a 10 by 10 array, count would be 10 10 = 100 since we are actually passing 100 values.
36 / 69
Note 4
Once you have sent a message, you must receive it on another task. The function MPI RECV is similar to the send call.
i n t MPI Recv ( void buf , i n t count , MPI Datatype datatype , i n t source , i n t tag , MPI Comm comm, MPI Status status )
The arguments that are different from those in MPI SEND are buf which is the name of the variable where you will be storing the received data, source which replaces the destination in the send command. This is the return ID of the sender. Finally, we have used MPI Status status; where one can check if the receive was completed. The output of this code is the same as the previous example, but now process 0 sends a message to process 1, which forwards it further to process 2, and so forth.
37 / 69
Integrating
Examples
The code example computes using the trapezoidal rules. The trapezoidal rule
b
I=
a
f (x )dx
h (f (a)/2 + f (a + h) + f (a +
38 / 69
double i n t f u n c t i o n ( double ) ; double t r a p e z o i d a l r u l e ( double , double , i n t , double ( ) ( double ) ) ; // Main f u n c t i o n begins here i n t main ( i n t nargs , char args [ ] ) { i n t n , l o c a l n , numprocs , my rank ; double a , b , h , l o c a l a , l o c a l b , t o t a l s u m , local sum ; double t i m e s t a r t , time end , t o t a l t i m e ;
39 / 69
40 / 69
MPI reduce
Here we have used
MPI reduce ( void senddata , void r e s u l t d a t a , i n t count , MPI Datatype datatype , MPI Op , i n t r o o t , MPI Comm comm)
The two variables senddata and resultdata are obvious, besides the fact that one sends the address of the variable or the rst element of an array. If they are arrays they need to have the same size. The variable count represents the total dimensionality, 1 in case of just one variable, while MPI Datatype denes the type of variable which is sent and received. The new feature is MPI Op. It denes the type of operation we want to do. In our case, since we are summing the rectangle contributions from every process we dene MPI Op = MPI SUM. If we have an array or matrix we can search for the largest og smallest element by sending either MPI MAX or MPI MIN. If we want the location as well (which array element) we simply transfer MPI MAXLOC or MPI MINOC. If we want the product we write MPI PROD. MPI Allreduce is dened as
M P I A l l r e d u c e ( void senddata , void r e s u l t d a t a , i n t count , MPI Datatype datatype , MPI Op , MPI Comm comm)
42 / 69
We use MPI reduce to collect data from each process. Note also the use of the function MPI Wtime. The nal functions are
43 / 69
Till now we have not paid much attention to speed and possible optimization possibilities inherent in the various compilers. We have compiled and linked as mpic++ mpic++ -c -o mycode.cpp mycode.exe
mycode.o
For Fortran replace with mpif90. This is what we call a at compiler option and should be used when we develop the code. It produces normally a very large and slow code when translated to machine instructions. We use this option for debugging and for establishing the correct program output because every operation is done precisely as the user specied it. It is instructive to look up the compiler manual for further instructions man mpic++ > out_to_file
45 / 69
We have additional compiler options for optimization. These may include procedure inlining where performance may be improved, moving constants inside loops outside the loop, identify potential parallelism, include automatic vectorization or replace a division with a reciprocal and a multiplication if this speeds up the code. mpic++ mpic++ -O3 -c -O3 -o mycode.cpp mycode.exe
mycode.o
This is the recommended option. But you must check that you get the same results as previously.
46 / 69
It is also useful to prole your program under the development stage. You would then compile with mpic++ mpic++ -pg -O3 -c -pg -O3 -o mycode.cpp mycode.exe
mycode.o
After you have run the code you can obtain the proling information via gprof mycode.exe > out_to_profile
When you have proled properly your code, you must take out this option as it increases your CPU expenditure. For memory tests use valgrind, see valgrind.org.
47 / 69
48 / 69
This is a rather simple and appealing method after von Neumann. Assume that we are looking at an interval x [a, b], this being the domain of the Probability distribution function (PDF) p(x ). Suppose also that the largest value our distribution function takes in this interval is M , that is p(x ) M x [a , b ].
Then we generate a random number x from the uniform distribution for x [a, b] and a corresponding number s for the uniform distribution between [0, M ]. If p (x ) s , we accept the new value of x , else we generate again two new random numbers x and s and perform the test in the latter equation again.
49 / 69
Acceptance-Rejection Method
I=
0
exp (x )dx .
Obviously to derive it analytically is much easier, however the integrand could pose some more difcult challenges. The aim here is simply to show how to implent the acceptance-rejection algorithm using MPI. The integral is the area below the curve f (x ) = exp (x ). If we uniformly ll the rectangle spanned by x [0, 3] and y [0, exp (3)], the fraction below the curve obatained from a uniform distribution, and multiplied by the area of the rectangle, should approximate the chosen integral. It is rather easy to implement this numerically, as shown in the following code.
50 / 69
51 / 69
//
// //
//
// //
Acceptance-Rejection Method
Here it can be useful to split the program into subtasks A specic function which performs the Monte Carlo sampling A function which collects all data and performs statistical analysis and perhaps writes in parallel to le.
53 / 69
void i n t e g r a t e ( i n t number cycles , double & I n t e g r a l ) { double t o t a l n u m b e r c y c l e s ; double v a r i a n c e , energy , e r r o r ; double t o t a l c u m u l a t i v e , t o t a l c u m u l a t i v e 2 , cumulative , cumulative 2 ; t o t a l n u m b e r c y c l e s = number cycles numprocs ; / / Do t h e mc sampling cumulative = cumulative 2 = 0 . 0 ; total cumulative = total cumulative 2 = 0.0;
56 / 69
57 / 69
Parallel Jacobi Algorithm Different data distribution schemes Row-wise distribution Column-wise distribution Other alternatives not discussed here: Cyclic shifting
58 / 69
Direct solvers such as Gauss elimination and LU decomposition Iterative solvers such Basic iterative solvers, Jacobi, Gauss-Seidel, Successive over-relaxation Other iterative methods such as Krylov subspace methods with Generalized minimum residual (GMRES) and Conjugate gradient etc
59 / 69
It is a simple method for solving x = b, A is a matrix and x and b are vectors. The vector x is the where A unknown. It is an iterative scheme where after k + 1 iterations we have 1 (b (L +U )x(k ) ), x(k +1) = D =D +U +L and D being a diagonal matrix, U an upper with A a lower triangular matrix. triangular matrix and L
60 / 69
61 / 69
Row-wise distribution
Assume dimension of matrix n n can be divided by number of CPUs P , m = n/P Blocks of m rows of coefcient matrix distributed to different CPUs; Vector of unknowns and RHS distributed similarly
62 / 69
Data to be communicated
Already have all on columns of matrix A each CPU; Only part of vector x is available on a CPU; Cannot carry out matrix vector multiplication directly; Need to communicate the vector x in the computations.
63 / 69
MPI_Allgather( void *localdata, int dim, void *olddata, int dim, MPI_Datatype dataty
64 / 69
Another method: Cyclic shift Shift partial vector x upward at each step; Do partial matrix-vector multiplication on each CPU at each step; After P steps (P is the number of CPUs), the overall matrix-vector multiplication is complete. Each CPU needs only to communicate with neighboring CPUs Provides opportunities to overlap communication with computations
65 / 69
Row-wise algo
66 / 69
Column-wise distribution
Blocks of m columns are of matrix A distributed among the different P CPUs Blocks of m rows of vectors x and bare distributed to different CPUs
68 / 69
Data to be communicated
Have already coefcient matrix data of m columns and a block of m rows of vector x. x can be A partial A computed on each CU independently. Need communication x to get the whole A using MPI Allreduce.
69 / 69