0% found this document useful (0 votes)

281 views

Parallel Programming Using MPI

This document provides an introduction to parallel programming using MPI (Message Passing Interface). It discusses strategies for developing and testing codes locally before running production jobs on supercomputers. It provides examples of how to compile and run simple MPI programs on both multi-core PCs and clusters. The document also gives an overview of the Titan supercomputer at the University of Oslo and how to submit and check the status of jobs run on this system using MPI.

Uploaded by

Ahmad Abba

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

281 views

Parallel Programming Using MPI

Uploaded by

Ahmad Abba

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

Parallel programming using MPI

Morten Hjorth-Jensen
Department of Physics and Center of Mathematics for Applications University of Oslo, N-0316 Oslo, Norway

NOTUR 2011

1 / 69

Target group
You have some experience in programming but have never tried to parallelize your codes Here I will base my examples on C/C++ and Fortran using Message Passing Interface (MPI). In order to get you started I will show some simple examples using numerical integration I will also give you some simple hints on how to run and install codes with MPI on your laptop/PC. The programs can be found at the weblink of http://www.uio.no/studier/emner/matnat/fys/ FYS3150/index-eng.xml, see under the programs link of for example fall 2010. See also Xing Cais lectures, right after these. Good text: Karniadakis and Kirby, Parallel Scientic Computing in C++ and MPI, Cambridge.
2 / 69

Strategies

Develop codes locally, run with some few processes and test your codes. Do benchmarking, timing and so forth on local nodes, for example your laptop or PC. You can install MPICH2 on your laptop/PC. Many UiO PCs running Linux have MPICH2 installed. Test by typing which mpd When you are convinced that your codes run correctly, you start your production runs on available supercomputers, in our case titan.uio.no.

3 / 69

How do I run MPI on a PC at UiO? (Linux setup here)

Most machines at computer labs at UiO are quad-cores Compile with mpicxx or mpic++ or mpif90 Set up collaboration between processes and run
mpd ncpus=4 & # run code w i t h mpiexec n 4 . / nameofprog

Here we declare that we will use 4 processes via the ncpus option and via n4 when running. End with
mpdallexit

4 / 69

Can I do it on my own PC/laptop?

Of course: go to http: //www.mcs.anl.gov/research/projects/mpich2/ follow the instructions and install it on your own PC/laptop If you use Ubuntu it is very simple sudo apt-get install mpich2 For windows, you may think of installing WUBI, see http://www.ubuntu.com/download/ubuntu/ windows-installer

5 / 69

Titan

Hardware
304 dual-cpu quad-core SUN X2200 Opteron nodes (total 2432 cores), 2.2 Ghz, and 8 - 16 GB RAM and 250 - 1000 GB disk on each node 3 eight-cpu quad-core Sun X4600 AMD Opteron nodes (total 96 cores), 2.5 Ghz, and 128, 128 and 256 GB memory, respectively Inniband interconnect Heterogenous cluster!

6 / 69

Titan
Software
Batch system: SLURM and MAUI Message Passing Interface (MPI):
OpenMPI Scampi MPICH2

Compilers: GCC, Intel, Portland and Pathscale Optimized math libraries and scientic applications All you need may be found under /site Available software: http: //www.hpc.uio.no/index.php/Titan_software

7 / 69

Getting started
Batch systems
A batch system controls the use of the cluster resources Submits the job to the right resource Monitors the job while executing Restarts the job in case of failure Takes care of priorities and queues to control execution order of unrelated jobs

Sun Grid Engine

SGE is the batch system used on Titan Jobs are executed either interactively or through job scripts Useful commands: showq, qlogin, sbatch http: //hpc.uio.no/index.php/Titan_User_Guide
8 / 69

Getting started
Modules
Different compilers, MPI-versions and applications need different sets of user environment variables The modules package lets you load and remove the different variable sets Useful commands:
List available modules: module avail Load module: module load <environment> Unload module: module unload <environment> Currently loaded: module list

http: //hpc.uio.no/index.php/Titan_User_Guide

9 / 69

Example

Interactively
# $ # $ # $ # $ $ $ $ # # $ # $ $ $ l o g i n to t i t a n ssh t i t a n . u i o . no ask f o r 4 cpus qlogin account=fys3150 n t a s k s =4 s t a r t a j o b setup , note t h e punktum ! source / s i t e / b i n / j o b s e t u p we want to use t h e i n t e l module module l o a d i n t e l module l o a d openmpi / 1 . 2 . 8 . i n t e l mkdir p fys3150 / mpiexample / cd fys3150 / mpiexample / Use program6 . cpp from t h e course pages , see c h a p t e r 4 compile t h e program mpic++ O3 o program6 . x program6 . cpp and execute i t mpirun . / program6 . x T r a p e z o i d a l r u l e = 3.14159 Time = 0.000378132 on number o f p r o c es s o r s : 4

10 / 69

The job script

job.slurm
# ! / b i n / sh # C a l l t h i s f i l e j o b . slurm # 4 cpus w i t h mpi ( or o t h e r communication ) #SBATCH n t a s k s =4 # 10 mins o f w a l l t i m e #SBATCH t i m e =0:10:00 # p r o j e c t fys3150 #SBATCH account=fys3150 # we need 2000 MB o f memory per process #SBATCH mem percpu=2000M # name o f j o b #SBATCH job name=program5 source / s i t e / b i n / j o b s e t u p # l o a d t h e module used when we compiled t h e program module l o a d i n t e l module l o a d openmpi / 1 . 2 . 8 . i n t e l

# s t a r t program mpirun . / program5 . x #END OF SCRIPT

11 / 69

Example

Submitting
# $ # $ $ l o g i n to t i t a n ssh t i t a n . u i o . no and submit i t sbatch j o b . slurm exit

12 / 69

Example
Checking execution
# check i f j o b i s r u n n i n g : $ showq u mhjensen ACTIVE JOBS JOBNAME USERNAME STATE 883129 1 A c t i v e Job mhjensen Running PROC 4 REMAINING 10:31:17 F r i Oct STARTTIME 2 13 : 5 9 : 2 5

2692 o f 4252 Processors A c t i v e (63.31%) 482 o f 602 Nodes A c t i v e (80.07%)

IDLE JOBS JOBNAME USERNAME STATE

PROC

WCLIMIT

QUEUETIME

0 I d l e Jobs BLOCKED JOBS JOBNAME USERNAME

STATE

PROC

WCLIMIT

QUEUETIME

T o t a l Jobs : 1

A c t i v e Jobs : 1

I d l e Jobs : 0

Blocked Jobs : 0

13 / 69

Tips and admonitions

Tips
Titan FAQ: http://www.hpc.uio.no/index.php/FAQ man-pages, e.g. man sbatch Ask us

Admonitions
Remember to exit from qlogin-sessions; the resource is reserved for you untill you exit Dont run jobs on login-nodes; these are only for compiling and editing les

14 / 69

What is Message Passing Interface (MPI)?

MPI is a library, not a language. It species the names, calling sequences and results of functions or subroutines to be called from C/C++ or Fortran programs, and the classes and methods that make up the MPI C++ library. The programs that users write in Fortran, C or C++ are compiled with ordinary compilers and linked with the MPI library. MPI programs should be able to run on all possible machines and run all MPI implementetations without change. An MPI computation is a collection of processes communicating with messages.

15 / 69

Going Parallel with MPI

Task parallelism: the work of a global problem can be divided into a number of independent tasks, which rarely need to synchronize. Monte Carlo simulations or numerical integration are examples of this. MPI is a message-passing library where all the routines have corresponding C/C++-binding
MPI Command name

and Fortran-binding (routine names are in uppercase, but can also be in lower case)
MPI COMMAND NAME

16 / 69

MPI

MPI is a library specication for the message passing interface, proposed as a standard. independent of hardware; not a language or compiler specication; not a specic implementation or product. A message passing standard for portability and ease-of-use. Designed for high performance. Insert communication and synchronization functions where necessary.

17 / 69

The basic ideas of parallel computing

Pursuit of shorter computation time and larger simulation size gives rise to parallel computing. Multiple processors are involved to solve a global problem. The essence is to divide the entire computation evenly among collaborative processors. Divide and conquer.

18 / 69

A rough classication of hardware models

Conventional single-processor computers can be called SISD (single-instruction-single-data) machines. SIMD (single-instruction-multiple-data) machines incorporate the idea of parallel processing, which use a large number of processing units to execute the same instruction on different data. Modern parallel computers are so-called MIMD (multiple-instruction- multiple-data) machines and can execute different instruction streams in parallel on different data.

19 / 69

Shared memory and distributed memory

One way of categorizing modern parallel computers is to look at the memory conguration. In shared memory systems the CPUs share the same address space. Any CPU can access any data in the global memory. In distributed memory systems each CPU has its own memory. The CPUs are connected by some network and may exchange messages.

20 / 69

Different parallel programming paradigms

Task parallelism the work of a global problem can be divided into a number of independent tasks, which rarely need to synchronize. Monte Carlo simulation is one example. Integration is another. However this paradigm is of limited use. Data parallelism use of multiple threads (e.g. one thread per processor) to dissect loops over arrays etc. This paradigm requires a single memory address space. Communication and synchronization between processors are often hidden, thus easy to program. However, the user surrenders much control to a specialized compiler. Examples of data parallelism are compiler-based parallelization and OpenMP directives. See Xing Cais lectures following this one.

21 / 69

Different parallel programming paradigms

Message-passing all involved processors have an independent memory address space. The user is responsible for partitioning the data/work of a global problem and distributing the subproblems to the processors. Collaboration between processors is achieved by explicit message passing, which is used for data transfer plus synchronization. This paradigm is the most general one where the user has full control. Better parallel efciency is usually achieved by explicit message passing. However, message-passing programming is more difcult.

22 / 69

SPMD
Although message-passing programming supports MIMD, it sufces with an SPMD (single-program-multiple-data) model, which is exible enough for practical cases: Same executable for all the processors. Each processor works primarily with its assigned local data. Progression of code is allowed to differ between synchronization points. Possible to have a master/slave model. The standard option in Monte Carlo calculations and numerical integration.

23 / 69

Todays situation of parallel computing

Distributed memory is the dominant hardware conguration. There is a large diversity in these machines, from MPP (massively parallel processing) systems to clusters of off-the-shelf PCs, which are very cost-effective. Message-passing is a mature programming paradigm and widely accepted. It often provides an efcient match to the hardware. It is primarily used for the distributed memory systems, but can also be used on shared memory systems. In these lectures we consider only message-passing for writing parallel programs.

24 / 69

Overhead present in parallel computing

Uneven load balance: not all the processors can perform useful work at all time. Overhead of synchronization. Overhead of communication. Extra computation due to parallelization. Due to the above overhead and that certain part of a sequential algorithm cannot be parallelized we may not achieve an optimal parallelization.

25 / 69

Parallelizing a sequential algorithm

Identify the part(s) of a sequential algorithm that can be executed in parallel. This is the difcult part, Distribute the global work and data among P processors.

26 / 69

Bindings to MPI routines

MPI is a message-passing library where all the routines have corresponding C/C++-binding
MPI Command name

and Fortran-binding (routine names are in uppercase, but can also be in lower case)
MPI COMMAND NAME

The discussion in these slides focuses on the C++ binding.

27 / 69

Communicator

A group of MPI processes with a name (context). Any process is identied by its rank. The rank is only meaningful within a particular communicator. By default communicator MPI COMM WORLD contains all the MPI processes. Mechanism to identify subset of processes. Promotes modular design of parallel libraries.

28 / 69

Some of the most important MPI functions

MPI Init - initiate an MPI computation MPI Finalize - terminate the MPI computation and clean up MPI Comm size - how many processes participate in a given MPI communicator? MPI Comm rank - which one am I? (A number between 0 and size-1.) MPI Send - send a message to a particular process within an MPI communicator MPI Recv - receive a message from a particular process within an MPI communicator MPI reduce or MPI Allreduce, send and receive messages

29 / 69

The rst MPI C/C++ program

Let every process write Hello world (oh not this program again!!) on the standard output.
using namespace s t d ; # include <mpi . h> # include < iostream > i n t main ( i n t nargs , char args [ ] ) { i n t numprocs , my rank ; // MPI i n i t i a l i z a t i o n s M P I I n i t (& nargs , &args ) ; MPI Comm size (MPI COMM WORLD, &numprocs ) ; MPI Comm rank (MPI COMM WORLD, &my rank ) ; c o u t << "Hello world, I have rank " << my rank << " out of " << numprocs << e n d l ; / / End MPI MPI Finalize ( ) ;
30 / 69

The Fortran program

PROGRAM h e l l o INCLUDE "mpif.h" INTEGER : : size , my rank , i e r r CALL MPI INIT ( i e r r ) CALL MPI COMM SIZE (MPI COMM WORLD, size , i e r r ) CALL MPI COMM RANK(MPI COMM WORLD, my rank , i e r r ) WRITE ( , ) "Hello world, Ive rank " , my rank , " out of " , s i z e CALL MPI FINALIZE ( i e r r ) END PROGRAM h e l l o

31 / 69

Note 1

The output to screen is not ordered since all processes are trying to write to screen simultaneously. It is then the operating system which opts for an ordering. If we wish to have an organized output, starting from the rst process, we may rewrite our program as in the next example.

32 / 69

Ordered output with MPI Barrier

i n t main ( i n t nargs , char args [ ] ) { i n t numprocs , my rank , i ; M P I I n i t (& nargs , &args ) ; MPI Comm size (MPI COMM WORLD, &numprocs ) ; MPI Comm rank (MPI COMM WORLD, &my rank ) ; f o r ( i = 0 ; i < numprocs ; i ++) {} M P I B a r r i e r (MPI COMM WORLD) ; i f ( i == my rank ) { c o u t << "Hello world, I have rank " << my rank << " out of " << numprocs << e n d l ; } MPI Finalize ( ) ;

33 / 69

Note 2
Here we have used the MPI Barrier function to ensure that that every process has completed its set of instructions in a particular order. A barrier is a special collective operation that does not allow the processes to continue until all processes in the communicator (here MPI COMM WORLD ) have called MPI Barrier . The barriers make sure that all processes have reached the same point in the code. Many of the collective operations like MPI ALLREDUCE to be discussed later, have the same property; viz. no process can exit the operation until all processes have started. However, this is slightly more time-consuming since the processes synchronize between themselves as many times as there are processes. In the next Hello world example we use the send and receive functions in order to a have a synchronized action.

34 / 69

Ordered output with MPI Recv and MPI Send

..... i n t numprocs , my rank , f l a g ; MPI Status status ; M P I I n i t (& nargs , &args ) ; MPI Comm size (MPI COMM WORLD, &numprocs ) ; MPI Comm rank (MPI COMM WORLD, &my rank ) ; i f ( my rank > 0 ) MPI Recv (& f l a g , 1 , MPI INT , my rank 1, 100 , MPI COMM WORLD, & status ) ; c o u t << "Hello world, I have rank " << my rank << " out of " << numprocs << e n d l ; i f ( my rank < numprocs 1) MPI Send (& my rank , 1 , MPI INT , my rank +1 , 100 , MPI COMM WORLD) ; MPI Finalize ( ) ;

35 / 69

Note 3
The basic sending of messages is given by the function MPI SEND , which in C/C++ is dened as
i n t MPI Send ( void buf , i n t count , MPI Datatype datatype , i n t dest , i n t tag , MPI Comm comm) }

This single command allows the passing of any kind of variable, even a large array, to any group of tasks. The variable buf is the variable we wish to send while count is the number of variables we are passing. If we are passing only a single value, this should be 1. If we transfer an array, it is the overall size of the array. For example, if we want to send a 10 by 10 array, count would be 10 10 = 100 since we are actually passing 100 values.

36 / 69

Note 4
Once you have sent a message, you must receive it on another task. The function MPI RECV is similar to the send call.
i n t MPI Recv ( void buf , i n t count , MPI Datatype datatype , i n t source , i n t tag , MPI Comm comm, MPI Status status )

The arguments that are different from those in MPI SEND are buf which is the name of the variable where you will be storing the received data, source which replaces the destination in the send command. This is the return ID of the sender. Finally, we have used MPI Status status; where one can check if the receive was completed. The output of this code is the same as the previous example, but now process 0 sends a message to process 1, which forwards it further to process 2, and so forth.
37 / 69

Integrating
Examples
The code example computes using the trapezoidal rules. The trapezoidal rule
b

I=
a

f (x )dx

h (f (a)/2 + f (a + h) + f (a +

38 / 69

Dissection of trapezoidal rule with MPI reduce

T r a p e z o i d a l r u l e and n u m e r i c a l i n t e g r a t i o n u s i g n MPI , example program6 . cpp using namespace s t d ; # include <mpi . h> # include < iostream > // Here we d e f i n e v a r i o u s f u n c t i o n s c a l l e d by t h e main program //

double i n t f u n c t i o n ( double ) ; double t r a p e z o i d a l r u l e ( double , double , i n t , double ( ) ( double ) ) ; // Main f u n c t i o n begins here i n t main ( i n t nargs , char args [ ] ) { i n t n , l o c a l n , numprocs , my rank ; double a , b , h , l o c a l a , l o c a l b , t o t a l s u m , local sum ; double t i m e s t a r t , time end , t o t a l t i m e ;

39 / 69

Dissection of trapezoidal rule with MPI reduce

/ / MPI i n i t i a l i z a t i o n s M P I I n i t (& nargs , &args ) ; MPI Comm size (MPI COMM WORLD, &numprocs ) ; MPI Comm rank (MPI COMM WORLD, &my rank ) ; t i m e s t a r t = MPI Wtime ( ) ; // Fixed v a l u e s f o r a , b and n a = 0 . 0 ; b = 1 . 0 ; n = 1000; h = ( ba ) / n ; / / h i s t h e same f o r a l l processes l o c a l n = n / numprocs ; / / make sure n > numprocs , e l s e i n t e g e r d i v i s i o n g i v e s zero / / Length o f each process i n t e r v a l o f / / i n t e g r a t i o n = l o c a l n h . l o c a l a = a + my rank l o c a l n h ; l o c a l b = l o c a l a + l o c a l n h ;

40 / 69

Dissection of trapezoidal rule with MPI reduce

total sum = 0.0; local sum = t r a p e z o i d a l r u l e ( local a , local b , local n , &int function ) ; MPI Reduce (& lo ca l s um , &t o t a l s u m , 1 , MPI DOUBLE , MPI SUM , 0 , MPI COMM WORLD) ; time end = MPI Wtime ( ) ; t o t a l t i m e = time end t i m e s t a r t ; i f ( my rank == 0 ) { c o u t << "Trapezoidal rule = " << t o t a l s u m << endl ; c o u t << "Time = " << t o t a l t i m e << " on number of processors: " << numprocs << e n d l ; } / / End MPI MPI Finalize ( ) ; return 0; } / / end o f main program
41 / 69

MPI reduce
Here we have used

MPI reduce ( void senddata , void r e s u l t d a t a , i n t count , MPI Datatype datatype , MPI Op , i n t r o o t , MPI Comm comm)
The two variables senddata and resultdata are obvious, besides the fact that one sends the address of the variable or the rst element of an array. If they are arrays they need to have the same size. The variable count represents the total dimensionality, 1 in case of just one variable, while MPI Datatype denes the type of variable which is sent and received. The new feature is MPI Op. It denes the type of operation we want to do. In our case, since we are summing the rectangle contributions from every process we dene MPI Op = MPI SUM. If we have an array or matrix we can search for the largest og smallest element by sending either MPI MAX or MPI MIN. If we want the location as well (which array element) we simply transfer MPI MAXLOC or MPI MINOC. If we want the product we write MPI PROD. MPI Allreduce is dened as

M P I A l l r e d u c e ( void senddata , void r e s u l t d a t a , i n t count , MPI Datatype datatype , MPI Op , MPI Comm comm)
42 / 69

Dissection of trapezoidal rule with MPI reduce

We use MPI reduce to collect data from each process. Note also the use of the function MPI Wtime. The nal functions are

// t h i s f u n c t i o n defines the f u n c t i o n to i n t e g r a t e double i n t f u n c t i o n ( double x ) { double v a l u e = 4 . / ( 1 . + x x ) ; return value ; } / / end o f f u n c t i o n t o e v a l u a t e

43 / 69

Dissection of trapezoidal rule with MPI reduce

// t h i s f u n c t i o n defines the t r a p e z o i d a l r u l e double t r a p e z o i d a l r u l e ( double a , double b , i n t n , double ( f u n c ) ( double ) ) { double trapez sum ; double fa , fb , x , s t e p ; int j; s t e p =( ba ) / ( ( double ) n ) ; f a =( f u n c ) ( a ) / 2 . ; f b =( f u n c ) ( b ) / 2 . ; trapez sum = 0 . ; f o r ( j =1; j <= n 1; j ++) { x= j s t e p +a ; trapez sum +=( f u n c ) ( x ) ; } trapez sum =( trapez sum+ f b + f a ) s t e p ; r e t u r n trapez sum ; } / / end t r a p e z o i d a l r u l e
44 / 69

Optimization and proling

Till now we have not paid much attention to speed and possible optimization possibilities inherent in the various compilers. We have compiled and linked as mpic++ mpic++ -c -o mycode.cpp mycode.exe

mycode.o

For Fortran replace with mpif90. This is what we call a at compiler option and should be used when we develop the code. It produces normally a very large and slow code when translated to machine instructions. We use this option for debugging and for establishing the correct program output because every operation is done precisely as the user specied it. It is instructive to look up the compiler manual for further instructions man mpic++ > out_to_file

45 / 69

Optimization and proling

We have additional compiler options for optimization. These may include procedure inlining where performance may be improved, moving constants inside loops outside the loop, identify potential parallelism, include automatic vectorization or replace a division with a reciprocal and a multiplication if this speeds up the code. mpic++ mpic++ -O3 -c -O3 -o mycode.cpp mycode.exe

mycode.o

This is the recommended option. But you must check that you get the same results as previously.

46 / 69

Optimization and proling

It is also useful to prole your program under the development stage. You would then compile with mpic++ mpic++ -pg -O3 -c -pg -O3 -o mycode.cpp mycode.exe

mycode.o

After you have run the code you can obtain the proling information via gprof mycode.exe > out_to_profile

When you have proled properly your code, you must take out this option as it increases your CPU expenditure. For memory tests use valgrind, see valgrind.org.

47 / 69

Optimization and proling

Other hints avoid if tests or call to functions inside loops, if possible. avoid multiplication with constants inside loops if possible Bad code for i = 1:n a(i) = b(i) +c*d e = g(k) end Better code temp = c*d for i = 1:n a(i) = b(i) + temp end e = g(k)

48 / 69

Monte Carlo integration: Acceptance-Rejection Method

This is a rather simple and appealing method after von Neumann. Assume that we are looking at an interval x [a, b], this being the domain of the Probability distribution function (PDF) p(x ). Suppose also that the largest value our distribution function takes in this interval is M , that is p(x ) M x [a , b ].

Then we generate a random number x from the uniform distribution for x [a, b] and a corresponding number s for the uniform distribution between [0, M ]. If p (x ) s , we accept the new value of x , else we generate again two new random numbers x and s and perform the test in the latter equation again.

49 / 69

Acceptance-Rejection Method

As an example, consider the evaluation of the integral

I=
0

exp (x )dx .

Obviously to derive it analytically is much easier, however the integrand could pose some more difcult challenges. The aim here is simply to show how to implent the acceptance-rejection algorithm using MPI. The integral is the area below the curve f (x ) = exp (x ). If we uniformly ll the rectangle spanned by x [0, 3] and y [0, exp (3)], the fraction below the curve obatained from a uniform distribution, and multiplied by the area of the rectangle, should approximate the chosen integral. It is rather easy to implement this numerically, as shown in the following code.

50 / 69

Simple Plot of the Accept-Reject Method

51 / 69

algo: Acceptance-Rejection Method

// Loop over Monte C a r l o t r i a l s n integral =0.; for ( int i = 1; i <= n ; i ++) { Finds a random v a l u e f o r x i n t h e i n t e r v a l [0 ,3] x = 3 ran0 (& idum ) ; Finds yv a l u e between [ 0 , exp ( 3 ) ] y = exp ( 3 . 0 ) ran0 (& idum ) ; i f t h e v a l u e o f y a t exp ( x ) i s below t h e curve , we accept i f ( y < exp ( x ) ) s = s+ 1 . 0 ; The i n t e g r a l i s area enclosed below t h e l i n e f ( x ) =exp ( x ) } Then we m u l t i p l y w i t h t h e area o f t h e r e c t a n g l e and d i v i d e by t h e number o f c y c l e s I n t e g r a l = 3 . exp ( 3 . ) s / n
52 / 69

// //

Acceptance-Rejection Method

Here it can be useful to split the program into subtasks A specic function which performs the Monte Carlo sampling A function which collects all data and performs statistical analysis and perhaps writes in parallel to le.

53 / 69

algo: Acceptance-Rejection Method

i n t main ( i n t argc , char argv [ ] ) { / / declarations .... / / MPI i n i t i a l i z a t i o n s M P I I n i t (& argc , &argv ) ; MPI Comm size (MPI COMM WORLD, &numprocs ) ; MPI Comm rank (MPI COMM WORLD, &my rank ) ; double t i m e s t a r t = MPI Wtime ( ) ; i f ( my rank == 0 && argc <= 1 ) { c o u t << "Bad Usage: " << argv [ 0 ] << " read also output file on same line" << e n d l ; } i f ( my rank == 0 && argc > 1 ) { o u t f i l e n a m e =argv [ 1 ] ; o f i l e . open ( o u t f i l e n a m e ) ; }
54 / 69

algo: Acceptance-Rejection Method

/ / Perform t h e i n t e g r a t i o n i n t e g r a t e ( MC samples , i n t e g r a l ) ; double time end = MPI Wtime ( ) ; double t o t a l t i m e = time end t i m e s t a r t ; i f ( my rank == 0 ) { c o u t << "Time = " << t o t a l t i m e << " on number of processors: " << numprocs << endl ; o f i l e << s e t i o s f l a g s ( i o s : : showpoint | i o s : : uppercase ) ; o f i l e << setw ( 1 5 ) << s e t p r e c i s i o n ( 8 ) << i n t e g r a l << e n d l ; o f i l e . close ( ) ; / / close output f i l e } / / End MPI MPI Finalize ( ) ; return 0; } / / end o f main f u n c t i o n
55 / 69

algo: Acceptance-Rejection Method

void i n t e g r a t e ( i n t number cycles , double & I n t e g r a l ) { double t o t a l n u m b e r c y c l e s ; double v a r i a n c e , energy , e r r o r ; double t o t a l c u m u l a t i v e , t o t a l c u m u l a t i v e 2 , cumulative , cumulative 2 ; t o t a l n u m b e r c y c l e s = number cycles numprocs ; / / Do t h e mc sampling cumulative = cumulative 2 = 0 . 0 ; total cumulative = total cumulative 2 = 0.0;

56 / 69

algo: Acceptance-Rejection Method

mc sampling ( number cycles , c u m u l a t i v e , cumulative 2 ) ; // C o l l e c t data i n t o t a l averages u s i n g MPI reduce M P I A l l r e d u c e (& c u m u l a t i v e , & t o t a l c u m u l a t i v e , 1 , MPI DOUBLE , MPI SUM , MPI COMM WORLD) ; M P I A l l r e d u c e (& c u m u l a t i v e 2 , & t o t a l c u m u l a t i v e 2 , 1 , MPI DOUBLE , MPI SUM , MPI COMM WORLD) ; I n t e g r a l = t o t a l c u m u l a t i v e / numprocs ; v a r i a n c e = t o t a l c u m u l a t i v e 2 / numprocs I n t e g r a l Integral ; e r r o r = s q r t ( v a r i a n c e / ( t o t a l n u m b e r c y c l e s 1.0) ) ; } / / end o f f u n c t i o n i n t e g r a t e

57 / 69

Matrix handling, Jacobis method

Parallel Jacobi Algorithm Different data distribution schemes Row-wise distribution Column-wise distribution Other alternatives not discussed here: Cyclic shifting

58 / 69

Matrix handling, Jacobis method

Direct solvers such as Gauss elimination and LU decomposition Iterative solvers such Basic iterative solvers, Jacobi, Gauss-Seidel, Successive over-relaxation Other iterative methods such as Krylov subspace methods with Generalized minimum residual (GMRES) and Conjugate gradient etc

59 / 69

Matrix handling, Jacobis method

It is a simple method for solving x = b, A is a matrix and x and b are vectors. The vector x is the where A unknown. It is an iterative scheme where after k + 1 iterations we have 1 (b (L +U )x(k ) ), x(k +1) = D =D +U +L and D being a diagonal matrix, U an upper with A a lower triangular matrix. triangular matrix and L

60 / 69

Matrix handling, Jacobis method

Shared memory or distributed memory: Shared-memory parallelization very straightforward Consider distributed memory machine using MPI Questions to answer in parallelization: Data distribution (data locality) How to distribute coefcient matrix among CPUs? How to distribute vector of unknowns? How to distribute RHS? Communication: What data needs to be communicated? Want to: Achieve data locality Minimize the number of communications Overlap communications with computations Load balance

61 / 69

Row-wise distribution

Assume dimension of matrix n n can be divided by number of CPUs P , m = n/P Blocks of m rows of coefcient matrix distributed to different CPUs; Vector of unknowns and RHS distributed similarly

62 / 69

Data to be communicated

Already have all on columns of matrix A each CPU; Only part of vector x is available on a CPU; Cannot carry out matrix vector multiplication directly; Need to communicate the vector x in the computations.

63 / 69

How to Communicate Vector x?

Gather partial vector x on each CPU to form the whole vector; Then matrix-vector multiplication on different CPUs proceed independently. Need MPI Allgather() function call All localdata are collected in olddata. Simple to implement, but A lot of communications Does not scale well for a large number of processors.

MPI_Allgather( void *localdata, int dim, void *olddata, int dim, MPI_Datatype dataty

64 / 69

How to Communicate Vector x?

Another method: Cyclic shift Shift partial vector x upward at each step; Do partial matrix-vector multiplication on each CPU at each step; After P steps (P is the number of CPUs), the overall matrix-vector multiplication is complete. Each CPU needs only to communicate with neighboring CPUs Provides opportunities to overlap communication with computations

65 / 69

Row-wise algo

66 / 69

Overlap Communications with Computations

Communications Each CPU needs to send its own partial vector x to upper neighboring CPU; Each CPU needs to receive data from lower neighboring CPU Overlap communications with computations: Each CPU does the following: Post non-blocking requests to send data to upper neighbor to to receive data from lower neighbor; This returns immediately Do partial computation with data currently available; Check non-blocking communication status; wait if necessary; Repeat above steps
67 / 69

Column-wise distribution

Blocks of m columns are of matrix A distributed among the different P CPUs Blocks of m rows of vectors x and bare distributed to different CPUs

68 / 69

Data to be communicated

Have already coefcient matrix data of m columns and a block of m rows of vector x. x can be A partial A computed on each CU independently. Need communication x to get the whole A using MPI Allreduce.

69 / 69

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (82)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
91% (35)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
William Gropp, Torsten Hoefler, Rajeev Thakur, Ewing Lusk Using Advanced MPI Modern Features of The Message-Passing Interface
No ratings yet
William Gropp, Torsten Hoefler, Rajeev Thakur, Ewing Lusk Using Advanced MPI Modern Features of The Message-Passing Interface
376 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
14 pages
2 Ebook Writing Research Proposal PDF
100% (5)
2 Ebook Writing Research Proposal PDF
113 pages
2 Ebook Writing Research Proposal PDF
100% (5)
2 Ebook Writing Research Proposal PDF
113 pages
MPI_tutorial_Fall_Break_2022
No ratings yet
MPI_tutorial_Fall_Break_2022
60 pages
An Introduction To Parallel Computing With MPI Computing Lab I
No ratings yet
An Introduction To Parallel Computing With MPI Computing Lab I
9 pages
PA
No ratings yet
PA
87 pages
Intro To MPI
No ratings yet
Intro To MPI
44 pages
Lab Mpi
No ratings yet
Lab Mpi
29 pages
Lab Mpi
No ratings yet
Lab Mpi
32 pages
Parallel Programming and MPI
No ratings yet
Parallel Programming and MPI
54 pages
Intro_MPI
No ratings yet
Intro_MPI
60 pages
[Scientific and Engineering Computation] William Gropp, Ewing L. Lusk, Anthony Skjellum, Rajeev Thakur - Using MPI and Using MPI-2 (1999, The MIT Press)
No ratings yet
[Scientific and Engineering Computation] William Gropp, Ewing L. Lusk, Anthony Skjellum, Rajeev Thakur - Using MPI and Using MPI-2 (1999, The MIT Press)
385 pages
Pcap Cse 3263 Lab Manual 2023
No ratings yet
Pcap Cse 3263 Lab Manual 2023
70 pages
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
No ratings yet
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
199 pages
Some Very Under Done Instructions For HPC 2013: Hpc@lists - Iitk.ac - in
No ratings yet
Some Very Under Done Instructions For HPC 2013: Hpc@lists - Iitk.ac - in
4 pages
Mpi Openmp Examples
No ratings yet
Mpi Openmp Examples
27 pages
Mpi Openmp Handouts
No ratings yet
Mpi Openmp Handouts
67 pages
BIg data anslysi
No ratings yet
BIg data anslysi
57 pages
Introduction To MPI Ranger Lonestar
No ratings yet
Introduction To MPI Ranger Lonestar
67 pages
Clase 4 - Tutorial de MPI
No ratings yet
Clase 4 - Tutorial de MPI
35 pages
Super Quick Introduction To MPI
No ratings yet
Super Quick Introduction To MPI
32 pages
Using MPI Portable Programming With The Message Pa PDF
No ratings yet
Using MPI Portable Programming With The Message Pa PDF
8 pages
Lecture 15 MPI Summarization
No ratings yet
Lecture 15 MPI Summarization
26 pages
Computing LLNL Gov
No ratings yet
Computing LLNL Gov
42 pages
High Performance Computing For Computational Mechanics: ISCM-10
No ratings yet
High Performance Computing For Computational Mechanics: ISCM-10
63 pages
Proceedings of the HPDC 2007 Symposium co located workshops CLADE 07 GMW 07 SOCP 07 UPGRADE CN 07 WORKS 07 June 25 29 2007 Monterey California USA Sigarch. - The ebook is available for quick download, easy access to content
100% (3)
Proceedings of the HPDC 2007 Symposium co located workshops CLADE 07 GMW 07 SOCP 07 UPGRADE CN 07 WORKS 07 June 25 29 2007 Monterey California USA Sigarch. - The ebook is available for quick download, easy access to content
51 pages
Class03 - MPI, Part 1, Intermediate PDF
No ratings yet
Class03 - MPI, Part 1, Intermediate PDF
83 pages
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
100% (1)
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
40 pages
Mpi
No ratings yet
Mpi
46 pages
Parallel Programming
No ratings yet
Parallel Programming
454 pages
HPC Day 11 PPT
No ratings yet
HPC Day 11 PPT
76 pages
Proceedings of the HPDC 2007 Symposium co located workshops CLADE 07 GMW 07 SOCP 07 UPGRADE CN 07 WORKS 07 June 25 29 2007 Monterey California USA Sigarch. 2024 Scribd Download
No ratings yet
Proceedings of the HPDC 2007 Symposium co located workshops CLADE 07 GMW 07 SOCP 07 UPGRADE CN 07 WORKS 07 June 25 29 2007 Monterey California USA Sigarch. 2024 Scribd Download
81 pages
HPC 2013
No ratings yet
HPC 2013
4 pages
Immediate download Proceedings of the HPDC 2007 Symposium co located workshops CLADE 07 GMW 07 SOCP 07 UPGRADE CN 07 WORKS 07 June 25 29 2007 Monterey California USA Sigarch. ebooks 2024
100% (3)
Immediate download Proceedings of the HPDC 2007 Symposium co located workshops CLADE 07 GMW 07 SOCP 07 UPGRADE CN 07 WORKS 07 June 25 29 2007 Monterey California USA Sigarch. ebooks 2024
75 pages
2-MPI
No ratings yet
2-MPI
13 pages
2013 02 24 Ppopp Mpi Basic
No ratings yet
2013 02 24 Ppopp Mpi Basic
102 pages
Introduction To Parallel Computing: What Is Parallel Computing? CS 480 - II Parallel and Scientific Computing
No ratings yet
Introduction To Parallel Computing: What Is Parallel Computing? CS 480 - II Parallel and Scientific Computing
10 pages
An Introduction To MPI: Parallel Programming With The Message Passing Interface
No ratings yet
An Introduction To MPI: Parallel Programming With The Message Passing Interface
48 pages
Mpi 1
No ratings yet
Mpi 1
38 pages
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Mpich2-1 0 8-Windevguide
100% (1)
Mpich2-1 0 8-Windevguide
36 pages
2013luv Supercomputers
No ratings yet
2013luv Supercomputers
12 pages
Mpi Half Day Public
No ratings yet
Mpi Half Day Public
140 pages
Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University
No ratings yet
Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University
53 pages
Using MPI-2 Advanced Features of The Message Passing Interface - Gropp W., Lusk E., Thakur R. (1999)
No ratings yet
Using MPI-2 Advanced Features of The Message Passing Interface - Gropp W., Lusk E., Thakur R. (1999)
275 pages
Lecture15 PDF
No ratings yet
Lecture15 PDF
32 pages
Mpi
No ratings yet
Mpi
17 pages
mpi_book
No ratings yet
mpi_book
673 pages
High Performance Computing
No ratings yet
High Performance Computing
10 pages
Mpi Lecture
No ratings yet
Mpi Lecture
129 pages
HPC Clusters Best Practices Performance Study
No ratings yet
HPC Clusters Best Practices Performance Study
38 pages
MPI Plamen Krastev
No ratings yet
MPI Plamen Krastev
49 pages
MPI Python Workshop Day1 Fall2024
No ratings yet
MPI Python Workshop Day1 Fall2024
22 pages
02 Message Passing Interface Tutorial
No ratings yet
02 Message Passing Interface Tutorial
34 pages
Lec 9 DR Marwa Abbas
No ratings yet
Lec 9 DR Marwa Abbas
64 pages
02 Mpi 0
No ratings yet
02 Mpi 0
19 pages
Message Passing Interface (MPI) : Author: Blaise Barney, Lawrence Livermore National Laboratory
No ratings yet
Message Passing Interface (MPI) : Author: Blaise Barney, Lawrence Livermore National Laboratory
41 pages
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
From Everand
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
Mr Troy
No ratings yet
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
CSC3201 - Compiler Construction (Part II) - Lecture 1 - Type Checking
No ratings yet
CSC3201 - Compiler Construction (Part II) - Lecture 1 - Type Checking
13 pages
MIT6 035S10 Lec05
No ratings yet
MIT6 035S10 Lec05
117 pages
Introduction To Shift-Reduce Parsing
No ratings yet
Introduction To Shift-Reduce Parsing
95 pages
Cs 0411 Midterm Examination: March 3, 2010 Duration: One and Half Hours
No ratings yet
Cs 0411 Midterm Examination: March 3, 2010 Duration: One and Half Hours
7 pages
Top-Down Parsing
No ratings yet
Top-Down Parsing
73 pages
CSC3201 - Compiler Construction (Part II) - Lecture 5 - Code Generation
No ratings yet
CSC3201 - Compiler Construction (Part II) - Lecture 5 - Code Generation
64 pages
Midterm Exam A Solutions, CS 1313 010 Spring 2000, University of Oklahoma, Norman
No ratings yet
Midterm Exam A Solutions, CS 1313 010 Spring 2000, University of Oklahoma, Norman
13 pages
FORTRAN90 - Practical Exercises: Session - 1
No ratings yet
FORTRAN90 - Practical Exercises: Session - 1
3 pages
Control Structure and Intrinsics
No ratings yet
Control Structure and Intrinsics
1 page
Exercise SELECT CASE - Graduate Programming Courses - UCL Wiki
No ratings yet
Exercise SELECT CASE - Graduate Programming Courses - UCL Wiki
1 page
Exercises in Fortran Programming
No ratings yet
Exercises in Fortran Programming
2 pages
Practice Exam 1 With Answers
No ratings yet
Practice Exam 1 With Answers
4 pages
CSC4212 Lecture 2 - 3D Viewing
No ratings yet
CSC4212 Lecture 2 - 3D Viewing
19 pages
FORTRAN Exam Questions
100% (1)
FORTRAN Exam Questions
3 pages
EECS 351-1 - Intro To Computer Graphics - Electrical Engineering & Computer Science - Northwestern Engineering
No ratings yet
EECS 351-1 - Intro To Computer Graphics - Electrical Engineering & Computer Science - Northwestern Engineering
2 pages
CSC4221 Lecture 2 - Graphics System
No ratings yet
CSC4221 Lecture 2 - Graphics System
21 pages
Part 1 - Lecture 1 - Introduction Parallel Computing
No ratings yet
Part 1 - Lecture 1 - Introduction Parallel Computing
33 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
Data Structures and Algorithms - Linked Lists
No ratings yet
Data Structures and Algorithms - Linked Lists
16 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
CSC4221 Lecture 1 - Introduction
No ratings yet
CSC4221 Lecture 1 - Introduction
60 pages
CSC4212 Lecture 3 - 3D Viewing - Projection Transformation
No ratings yet
CSC4212 Lecture 3 - 3D Viewing - Projection Transformation
31 pages
Mobile World Congress 2014
No ratings yet
Mobile World Congress 2014
2 pages
Data Structures and Algorithms - Lecture 1 - Arrays
100% (5)
Data Structures and Algorithms - Lecture 1 - Arrays
25 pages
A Simple and Efficient FFT Implementation in C++ Part I
No ratings yet
A Simple and Efficient FFT Implementation in C++ Part I
4 pages
Communicating As A Scientists
No ratings yet
Communicating As A Scientists
3 pages
Cloud Computing Important Questions
0% (1)
Cloud Computing Important Questions
4 pages
2022 06 23 Engine All Hands
No ratings yet
2022 06 23 Engine All Hands
34 pages

Parallel Programming Using MPI

Uploaded by

Parallel Programming Using MPI

Uploaded by

Parallel programming using MPI

How do I run MPI on a PC at UiO? (Linux setup here)

Can I do it on my own PC/laptop?

Sun Grid Engine

The job script

# s t a r t program mpirun . / program5 . x #END OF SCRIPT

2692 o f 4252 Processors A c t i v e (63.31%) 482 o f 602 Nodes A c t i v e (80.07%)

IDLE JOBS JOBNAME USERNAME STATE

0 I d l e Jobs BLOCKED JOBS JOBNAME USERNAME

Tips and admonitions

What is Message Passing Interface (MPI)?

Going Parallel with MPI

The basic ideas of parallel computing

A rough classication of hardware models

Shared memory and distributed memory

Different parallel programming paradigms

Different parallel programming paradigms

Todays situation of parallel computing

Overhead present in parallel computing

Parallelizing a sequential algorithm

Bindings to MPI routines

The discussion in these slides focuses on the C++ binding.

Some of the most important MPI functions

The rst MPI C/C++ program

The Fortran program

Ordered output with MPI Barrier

Ordered output with MPI Recv and MPI Send

Dissection of trapezoidal rule with MPI reduce

Dissection of trapezoidal rule with MPI reduce

Dissection of trapezoidal rule with MPI reduce

Dissection of trapezoidal rule with MPI reduce

// t h i s f u n c t i o n defines the f u n c t i o n to i n t e g r a t e double i n t f u n c t i o n ( double x ) { double v a l u e = 4 . / ( 1 . + x x ) ; return value ; } / / end o f f u n c t i o n t o e v a l u a t e

Dissection of trapezoidal rule with MPI reduce

Optimization and proling

Optimization and proling

Optimization and proling

Optimization and proling

Monte Carlo integration: Acceptance-Rejection Method

As an example, consider the evaluation of the integral

Simple Plot of the Accept-Reject Method

algo: Acceptance-Rejection Method

algo: Acceptance-Rejection Method

algo: Acceptance-Rejection Method

algo: Acceptance-Rejection Method

algo: Acceptance-Rejection Method

Matrix handling, Jacobis method

Matrix handling, Jacobis method

Matrix handling, Jacobis method

Matrix handling, Jacobis method

How to Communicate Vector x?

How to Communicate Vector x?

Overlap Communications with Computations

You might also like