Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
261 views

Parallel Programming Using MPI

This document provides an introduction to parallel programming using MPI (Message Passing Interface). It discusses strategies for developing and testing codes locally before running production jobs on supercomputers. It provides examples of how to compile and run simple MPI programs on both multi-core PCs and clusters. The document also gives an overview of the Titan supercomputer at the University of Oslo and how to submit and check the status of jobs run on this system using MPI.

Uploaded by

Ahmad Abba
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
261 views

Parallel Programming Using MPI

This document provides an introduction to parallel programming using MPI (Message Passing Interface). It discusses strategies for developing and testing codes locally before running production jobs on supercomputers. It provides examples of how to compile and run simple MPI programs on both multi-core PCs and clusters. The document also gives an overview of the Titan supercomputer at the University of Oslo and how to submit and check the status of jobs run on this system using MPI.

Uploaded by

Ahmad Abba
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Parallel programming using MPI

Morten Hjorth-Jensen
Department of Physics and Center of Mathematics for Applications University of Oslo, N-0316 Oslo, Norway

NOTUR 2011

1 / 69

Target group
You have some experience in programming but have never tried to parallelize your codes Here I will base my examples on C/C++ and Fortran using Message Passing Interface (MPI). In order to get you started I will show some simple examples using numerical integration I will also give you some simple hints on how to run and install codes with MPI on your laptop/PC. The programs can be found at the weblink of http://www.uio.no/studier/emner/matnat/fys/ FYS3150/index-eng.xml, see under the programs link of for example fall 2010. See also Xing Cais lectures, right after these. Good text: Karniadakis and Kirby, Parallel Scientic Computing in C++ and MPI, Cambridge.
2 / 69

Strategies

Develop codes locally, run with some few processes and test your codes. Do benchmarking, timing and so forth on local nodes, for example your laptop or PC. You can install MPICH2 on your laptop/PC. Many UiO PCs running Linux have MPICH2 installed. Test by typing which mpd When you are convinced that your codes run correctly, you start your production runs on available supercomputers, in our case titan.uio.no.

3 / 69

How do I run MPI on a PC at UiO? (Linux setup here)


Most machines at computer labs at UiO are quad-cores Compile with mpicxx or mpic++ or mpif90 Set up collaboration between processes and run
mpd ncpus=4 & # run code w i t h mpiexec n 4 . / nameofprog

Here we declare that we will use 4 processes via the ncpus option and via n4 when running. End with
mpdallexit

4 / 69

Can I do it on my own PC/laptop?

Of course: go to http: //www.mcs.anl.gov/research/projects/mpich2/ follow the instructions and install it on your own PC/laptop If you use Ubuntu it is very simple sudo apt-get install mpich2 For windows, you may think of installing WUBI, see http://www.ubuntu.com/download/ubuntu/ windows-installer

5 / 69

Titan

Hardware
304 dual-cpu quad-core SUN X2200 Opteron nodes (total 2432 cores), 2.2 Ghz, and 8 - 16 GB RAM and 250 - 1000 GB disk on each node 3 eight-cpu quad-core Sun X4600 AMD Opteron nodes (total 96 cores), 2.5 Ghz, and 128, 128 and 256 GB memory, respectively Inniband interconnect Heterogenous cluster!

6 / 69

Titan
Software
Batch system: SLURM and MAUI Message Passing Interface (MPI):
OpenMPI Scampi MPICH2

Compilers: GCC, Intel, Portland and Pathscale Optimized math libraries and scientic applications All you need may be found under /site Available software: http: //www.hpc.uio.no/index.php/Titan_software

7 / 69

Getting started
Batch systems
A batch system controls the use of the cluster resources Submits the job to the right resource Monitors the job while executing Restarts the job in case of failure Takes care of priorities and queues to control execution order of unrelated jobs

Sun Grid Engine


SGE is the batch system used on Titan Jobs are executed either interactively or through job scripts Useful commands: showq, qlogin, sbatch http: //hpc.uio.no/index.php/Titan_User_Guide
8 / 69

Getting started
Modules
Different compilers, MPI-versions and applications need different sets of user environment variables The modules package lets you load and remove the different variable sets Useful commands:
List available modules: module avail Load module: module load <environment> Unload module: module unload <environment> Currently loaded: module list

http: //hpc.uio.no/index.php/Titan_User_Guide

9 / 69

Example

Interactively
# $ # $ # $ # $ $ $ $ # # $ # $ $ $ l o g i n to t i t a n ssh t i t a n . u i o . no ask f o r 4 cpus qlogin account=fys3150 n t a s k s =4 s t a r t a j o b setup , note t h e punktum ! source / s i t e / b i n / j o b s e t u p we want to use t h e i n t e l module module l o a d i n t e l module l o a d openmpi / 1 . 2 . 8 . i n t e l mkdir p fys3150 / mpiexample / cd fys3150 / mpiexample / Use program6 . cpp from t h e course pages , see c h a p t e r 4 compile t h e program mpic++ O3 o program6 . x program6 . cpp and execute i t mpirun . / program6 . x T r a p e z o i d a l r u l e = 3.14159 Time = 0.000378132 on number o f p r o c es s o r s : 4

10 / 69

The job script


job.slurm
# ! / b i n / sh # C a l l t h i s f i l e j o b . slurm # 4 cpus w i t h mpi ( or o t h e r communication ) #SBATCH n t a s k s =4 # 10 mins o f w a l l t i m e #SBATCH t i m e =0:10:00 # p r o j e c t fys3150 #SBATCH account=fys3150 # we need 2000 MB o f memory per process #SBATCH mem percpu=2000M # name o f j o b #SBATCH job name=program5 source / s i t e / b i n / j o b s e t u p # l o a d t h e module used when we compiled t h e program module l o a d i n t e l module l o a d openmpi / 1 . 2 . 8 . i n t e l

# s t a r t program mpirun . / program5 . x #END OF SCRIPT

11 / 69

Example

Submitting
# $ # $ $ l o g i n to t i t a n ssh t i t a n . u i o . no and submit i t sbatch j o b . slurm exit

12 / 69

Example
Checking execution
# check i f j o b i s r u n n i n g : $ showq u mhjensen ACTIVE JOBS JOBNAME USERNAME STATE 883129 1 A c t i v e Job mhjensen Running PROC 4 REMAINING 10:31:17 F r i Oct STARTTIME 2 13 : 5 9 : 2 5

2692 o f 4252 Processors A c t i v e (63.31%) 482 o f 602 Nodes A c t i v e (80.07%)

IDLE JOBS JOBNAME USERNAME STATE

PROC

WCLIMIT

QUEUETIME

0 I d l e Jobs BLOCKED JOBS JOBNAME USERNAME

STATE

PROC

WCLIMIT

QUEUETIME

T o t a l Jobs : 1

A c t i v e Jobs : 1

I d l e Jobs : 0

Blocked Jobs : 0

13 / 69

Tips and admonitions


Tips
Titan FAQ: http://www.hpc.uio.no/index.php/FAQ man-pages, e.g. man sbatch Ask us

Admonitions
Remember to exit from qlogin-sessions; the resource is reserved for you untill you exit Dont run jobs on login-nodes; these are only for compiling and editing les

14 / 69

What is Message Passing Interface (MPI)?

MPI is a library, not a language. It species the names, calling sequences and results of functions or subroutines to be called from C/C++ or Fortran programs, and the classes and methods that make up the MPI C++ library. The programs that users write in Fortran, C or C++ are compiled with ordinary compilers and linked with the MPI library. MPI programs should be able to run on all possible machines and run all MPI implementetations without change. An MPI computation is a collection of processes communicating with messages.

15 / 69

Going Parallel with MPI


Task parallelism: the work of a global problem can be divided into a number of independent tasks, which rarely need to synchronize. Monte Carlo simulations or numerical integration are examples of this. MPI is a message-passing library where all the routines have corresponding C/C++-binding
MPI Command name

and Fortran-binding (routine names are in uppercase, but can also be in lower case)
MPI COMMAND NAME

16 / 69

MPI

MPI is a library specication for the message passing interface, proposed as a standard. independent of hardware; not a language or compiler specication; not a specic implementation or product. A message passing standard for portability and ease-of-use. Designed for high performance. Insert communication and synchronization functions where necessary.

17 / 69

The basic ideas of parallel computing

Pursuit of shorter computation time and larger simulation size gives rise to parallel computing. Multiple processors are involved to solve a global problem. The essence is to divide the entire computation evenly among collaborative processors. Divide and conquer.

18 / 69

A rough classication of hardware models

Conventional single-processor computers can be called SISD (single-instruction-single-data) machines. SIMD (single-instruction-multiple-data) machines incorporate the idea of parallel processing, which use a large number of process- ing units to execute the same instruction on different data. Modern parallel computers are so-called MIMD (multiple-instruction- multiple-data) machines and can execute different instruction streams in parallel on different data.

19 / 69

Shared memory and distributed memory

One way of categorizing modern parallel computers is to look at the memory conguration. In shared memory systems the CPUs share the same address space. Any CPU can access any data in the global memory. In distributed memory systems each CPU has its own memory. The CPUs are connected by some network and may exchange messages.

20 / 69

Different parallel programming paradigms


Task parallelism the work of a global problem can be divided into a number of independent tasks, which rarely need to synchronize. Monte Carlo simulation is one example. Integration is another. However this paradigm is of limited use. Data parallelism use of multiple threads (e.g. one thread per processor) to dissect loops over arrays etc. This paradigm requires a single memory address space. Communication and synchronization between processors are often hidden, thus easy to program. However, the user surrenders much control to a specialized compiler. Examples of data parallelism are compiler-based parallelization and OpenMP directives. See Xing Cais lectures following this one.

21 / 69

Different parallel programming paradigms

Message-passing all involved processors have an independent memory address space. The user is responsible for partitioning the data/work of a global problem and distributing the subproblems to the processors. Collaboration between processors is achieved by explicit message passing, which is used for data transfer plus synchronization. This paradigm is the most general one where the user has full control. Better parallel efciency is usually achieved by explicit message passing. However, message-passing programming is more difcult.

22 / 69

SPMD
Although message-passing programming supports MIMD, it sufces with an SPMD (single-program-multiple-data) model, which is exible enough for practical cases: Same executable for all the processors. Each processor works primarily with its assigned local data. Progression of code is allowed to differ between synchronization points. Possible to have a master/slave model. The standard option in Monte Carlo calculations and numerical integration.

23 / 69

Todays situation of parallel computing

Distributed memory is the dominant hardware conguration. There is a large diversity in these machines, from MPP (massively parallel processing) systems to clusters of off-the-shelf PCs, which are very cost-effective. Message-passing is a mature programming paradigm and widely accepted. It often provides an efcient match to the hardware. It is primarily used for the distributed memory systems, but can also be used on shared memory systems. In these lectures we consider only message-passing for writing parallel programs.

24 / 69

Overhead present in parallel computing

Uneven load balance: not all the processors can perform useful work at all time. Overhead of synchronization. Overhead of communication. Extra computation due to parallelization. Due to the above overhead and that certain part of a sequential algorithm cannot be parallelized we may not achieve an optimal parallelization.

25 / 69

Parallelizing a sequential algorithm

Identify the part(s) of a sequential algorithm that can be executed in parallel. This is the difcult part, Distribute the global work and data among P processors.

26 / 69

Bindings to MPI routines

MPI is a message-passing library where all the routines have corresponding C/C++-binding
MPI Command name

and Fortran-binding (routine names are in uppercase, but can also be in lower case)
MPI COMMAND NAME

The discussion in these slides focuses on the C++ binding.

27 / 69

Communicator

A group of MPI processes with a name (context). Any process is identied by its rank. The rank is only meaningful within a particular communicator. By default communicator MPI COMM WORLD contains all the MPI processes. Mechanism to identify subset of processes. Promotes modular design of parallel libraries.

28 / 69

Some of the most important MPI functions


MPI Init - initiate an MPI computation MPI Finalize - terminate the MPI computation and clean up MPI Comm size - how many processes participate in a given MPI communicator? MPI Comm rank - which one am I? (A number between 0 and size-1.) MPI Send - send a message to a particular process within an MPI communicator MPI Recv - receive a message from a particular process within an MPI communicator MPI reduce or MPI Allreduce, send and receive messages

29 / 69

The rst MPI C/C++ program


Let every process write Hello world (oh not this program again!!) on the standard output.
using namespace s t d ; # include <mpi . h> # include < iostream > i n t main ( i n t nargs , char args [ ] ) { i n t numprocs , my rank ; // MPI i n i t i a l i z a t i o n s M P I I n i t (& nargs , &args ) ; MPI Comm size (MPI COMM WORLD, &numprocs ) ; MPI Comm rank (MPI COMM WORLD, &my rank ) ; c o u t << "Hello world, I have rank " << my rank << " out of " << numprocs << e n d l ; / / End MPI MPI Finalize ( ) ;
30 / 69

The Fortran program

PROGRAM h e l l o INCLUDE "mpif.h" INTEGER : : size , my rank , i e r r CALL MPI INIT ( i e r r ) CALL MPI COMM SIZE (MPI COMM WORLD, size , i e r r ) CALL MPI COMM RANK(MPI COMM WORLD, my rank , i e r r ) WRITE ( , ) "Hello world, Ive rank " , my rank , " out of " , s i z e CALL MPI FINALIZE ( i e r r ) END PROGRAM h e l l o

31 / 69

Note 1

The output to screen is not ordered since all processes are trying to write to screen simultaneously. It is then the operating system which opts for an ordering. If we wish to have an organized output, starting from the rst process, we may rewrite our program as in the next example.

32 / 69

Ordered output with MPI Barrier

i n t main ( i n t nargs , char args [ ] ) { i n t numprocs , my rank , i ; M P I I n i t (& nargs , &args ) ; MPI Comm size (MPI COMM WORLD, &numprocs ) ; MPI Comm rank (MPI COMM WORLD, &my rank ) ; f o r ( i = 0 ; i < numprocs ; i ++) {} M P I B a r r i e r (MPI COMM WORLD) ; i f ( i == my rank ) { c o u t << "Hello world, I have rank " << my rank << " out of " << numprocs << e n d l ; } MPI Finalize ( ) ;

33 / 69

Note 2
Here we have used the MPI Barrier function to ensure that that every process has completed its set of instructions in a particular order. A barrier is a special collective operation that does not allow the processes to continue until all processes in the communicator (here MPI COMM WORLD ) have called MPI Barrier . The barriers make sure that all processes have reached the same point in the code. Many of the collective operations like MPI ALLREDUCE to be discussed later, have the same property; viz. no process can exit the operation until all processes have started. However, this is slightly more time-consuming since the processes synchronize between themselves as many times as there are processes. In the next Hello world example we use the send and receive functions in order to a have a synchronized action.

34 / 69

Ordered output with MPI Recv and MPI Send


..... i n t numprocs , my rank , f l a g ; MPI Status status ; M P I I n i t (& nargs , &args ) ; MPI Comm size (MPI COMM WORLD, &numprocs ) ; MPI Comm rank (MPI COMM WORLD, &my rank ) ; i f ( my rank > 0 ) MPI Recv (& f l a g , 1 , MPI INT , my rank 1, 100 , MPI COMM WORLD, & status ) ; c o u t << "Hello world, I have rank " << my rank << " out of " << numprocs << e n d l ; i f ( my rank < numprocs 1) MPI Send (& my rank , 1 , MPI INT , my rank +1 , 100 , MPI COMM WORLD) ; MPI Finalize ( ) ;

35 / 69

Note 3
The basic sending of messages is given by the function MPI SEND , which in C/C++ is dened as
i n t MPI Send ( void buf , i n t count , MPI Datatype datatype , i n t dest , i n t tag , MPI Comm comm) }

This single command allows the passing of any kind of variable, even a large array, to any group of tasks. The variable buf is the variable we wish to send while count is the number of variables we are passing. If we are passing only a single value, this should be 1. If we transfer an array, it is the overall size of the array. For example, if we want to send a 10 by 10 array, count would be 10 10 = 100 since we are actually passing 100 values.

36 / 69

Note 4
Once you have sent a message, you must receive it on another task. The function MPI RECV is similar to the send call.
i n t MPI Recv ( void buf , i n t count , MPI Datatype datatype , i n t source , i n t tag , MPI Comm comm, MPI Status status )

The arguments that are different from those in MPI SEND are buf which is the name of the variable where you will be storing the received data, source which replaces the destination in the send command. This is the return ID of the sender. Finally, we have used MPI Status status; where one can check if the receive was completed. The output of this code is the same as the previous example, but now process 0 sends a message to process 1, which forwards it further to process 2, and so forth.
37 / 69

Integrating
Examples
The code example computes using the trapezoidal rules. The trapezoidal rule
b

I=
a

f (x )dx

h (f (a)/2 + f (a + h) + f (a +

38 / 69

Dissection of trapezoidal rule with MPI reduce


T r a p e z o i d a l r u l e and n u m e r i c a l i n t e g r a t i o n u s i g n MPI , example program6 . cpp using namespace s t d ; # include <mpi . h> # include < iostream > // Here we d e f i n e v a r i o u s f u n c t i o n s c a l l e d by t h e main program //

double i n t f u n c t i o n ( double ) ; double t r a p e z o i d a l r u l e ( double , double , i n t , double ( ) ( double ) ) ; // Main f u n c t i o n begins here i n t main ( i n t nargs , char args [ ] ) { i n t n , l o c a l n , numprocs , my rank ; double a , b , h , l o c a l a , l o c a l b , t o t a l s u m , local sum ; double t i m e s t a r t , time end , t o t a l t i m e ;

39 / 69

Dissection of trapezoidal rule with MPI reduce


/ / MPI i n i t i a l i z a t i o n s M P I I n i t (& nargs , &args ) ; MPI Comm size (MPI COMM WORLD, &numprocs ) ; MPI Comm rank (MPI COMM WORLD, &my rank ) ; t i m e s t a r t = MPI Wtime ( ) ; // Fixed v a l u e s f o r a , b and n a = 0 . 0 ; b = 1 . 0 ; n = 1000; h = ( ba ) / n ; / / h i s t h e same f o r a l l processes l o c a l n = n / numprocs ; / / make sure n > numprocs , e l s e i n t e g e r d i v i s i o n g i v e s zero / / Length o f each process i n t e r v a l o f / / i n t e g r a t i o n = l o c a l n h . l o c a l a = a + my rank l o c a l n h ; l o c a l b = l o c a l a + l o c a l n h ;

40 / 69

Dissection of trapezoidal rule with MPI reduce


total sum = 0.0; local sum = t r a p e z o i d a l r u l e ( local a , local b , local n , &int function ) ; MPI Reduce (& lo ca l s um , &t o t a l s u m , 1 , MPI DOUBLE , MPI SUM , 0 , MPI COMM WORLD) ; time end = MPI Wtime ( ) ; t o t a l t i m e = time end t i m e s t a r t ; i f ( my rank == 0 ) { c o u t << "Trapezoidal rule = " << t o t a l s u m << endl ; c o u t << "Time = " << t o t a l t i m e << " on number of processors: " << numprocs << e n d l ; } / / End MPI MPI Finalize ( ) ; return 0; } / / end o f main program
41 / 69

MPI reduce
Here we have used

MPI reduce ( void senddata , void r e s u l t d a t a , i n t count , MPI Datatype datatype , MPI Op , i n t r o o t , MPI Comm comm)
The two variables senddata and resultdata are obvious, besides the fact that one sends the address of the variable or the rst element of an array. If they are arrays they need to have the same size. The variable count represents the total dimensionality, 1 in case of just one variable, while MPI Datatype denes the type of variable which is sent and received. The new feature is MPI Op. It denes the type of operation we want to do. In our case, since we are summing the rectangle contributions from every process we dene MPI Op = MPI SUM. If we have an array or matrix we can search for the largest og smallest element by sending either MPI MAX or MPI MIN. If we want the location as well (which array element) we simply transfer MPI MAXLOC or MPI MINOC. If we want the product we write MPI PROD. MPI Allreduce is dened as

M P I A l l r e d u c e ( void senddata , void r e s u l t d a t a , i n t count , MPI Datatype datatype , MPI Op , MPI Comm comm)
42 / 69

Dissection of trapezoidal rule with MPI reduce

We use MPI reduce to collect data from each process. Note also the use of the function MPI Wtime. The nal functions are

// t h i s f u n c t i o n defines the f u n c t i o n to i n t e g r a t e double i n t f u n c t i o n ( double x ) { double v a l u e = 4 . / ( 1 . + x x ) ; return value ; } / / end o f f u n c t i o n t o e v a l u a t e

43 / 69

Dissection of trapezoidal rule with MPI reduce


// t h i s f u n c t i o n defines the t r a p e z o i d a l r u l e double t r a p e z o i d a l r u l e ( double a , double b , i n t n , double ( f u n c ) ( double ) ) { double trapez sum ; double fa , fb , x , s t e p ; int j; s t e p =( ba ) / ( ( double ) n ) ; f a =( f u n c ) ( a ) / 2 . ; f b =( f u n c ) ( b ) / 2 . ; trapez sum = 0 . ; f o r ( j =1; j <= n 1; j ++) { x= j s t e p +a ; trapez sum +=( f u n c ) ( x ) ; } trapez sum =( trapez sum+ f b + f a ) s t e p ; r e t u r n trapez sum ; } / / end t r a p e z o i d a l r u l e
44 / 69

Optimization and proling

Till now we have not paid much attention to speed and possible optimization possibilities inherent in the various compilers. We have compiled and linked as mpic++ mpic++ -c -o mycode.cpp mycode.exe

mycode.o

For Fortran replace with mpif90. This is what we call a at compiler option and should be used when we develop the code. It produces normally a very large and slow code when translated to machine instructions. We use this option for debugging and for establishing the correct program output because every operation is done precisely as the user specied it. It is instructive to look up the compiler manual for further instructions man mpic++ > out_to_file

45 / 69

Optimization and proling

We have additional compiler options for optimization. These may include procedure inlining where performance may be improved, moving constants inside loops outside the loop, identify potential parallelism, include automatic vectorization or replace a division with a reciprocal and a multiplication if this speeds up the code. mpic++ mpic++ -O3 -c -O3 -o mycode.cpp mycode.exe

mycode.o

This is the recommended option. But you must check that you get the same results as previously.

46 / 69

Optimization and proling

It is also useful to prole your program under the development stage. You would then compile with mpic++ mpic++ -pg -O3 -c -pg -O3 -o mycode.cpp mycode.exe

mycode.o

After you have run the code you can obtain the proling information via gprof mycode.exe > out_to_profile

When you have proled properly your code, you must take out this option as it increases your CPU expenditure. For memory tests use valgrind, see valgrind.org.

47 / 69

Optimization and proling


Other hints avoid if tests or call to functions inside loops, if possible. avoid multiplication with constants inside loops if possible Bad code for i = 1:n a(i) = b(i) +c*d e = g(k) end Better code temp = c*d for i = 1:n a(i) = b(i) + temp end e = g(k)

48 / 69

Monte Carlo integration: Acceptance-Rejection Method

This is a rather simple and appealing method after von Neumann. Assume that we are looking at an interval x [a, b], this being the domain of the Probability distribution function (PDF) p(x ). Suppose also that the largest value our distribution function takes in this interval is M , that is p(x ) M x [a , b ].

Then we generate a random number x from the uniform distribution for x [a, b] and a corresponding number s for the uniform distribution between [0, M ]. If p (x ) s , we accept the new value of x , else we generate again two new random numbers x and s and perform the test in the latter equation again.

49 / 69

Acceptance-Rejection Method

As an example, consider the evaluation of the integral


3

I=
0

exp (x )dx .

Obviously to derive it analytically is much easier, however the integrand could pose some more difcult challenges. The aim here is simply to show how to implent the acceptance-rejection algorithm using MPI. The integral is the area below the curve f (x ) = exp (x ). If we uniformly ll the rectangle spanned by x [0, 3] and y [0, exp (3)], the fraction below the curve obatained from a uniform distribution, and multiplied by the area of the rectangle, should approximate the chosen integral. It is rather easy to implement this numerically, as shown in the following code.

50 / 69

Simple Plot of the Accept-Reject Method

51 / 69

algo: Acceptance-Rejection Method


// Loop over Monte C a r l o t r i a l s n integral =0.; for ( int i = 1; i <= n ; i ++) { Finds a random v a l u e f o r x i n t h e i n t e r v a l [0 ,3] x = 3 ran0 (& idum ) ; Finds yv a l u e between [ 0 , exp ( 3 ) ] y = exp ( 3 . 0 ) ran0 (& idum ) ; i f t h e v a l u e o f y a t exp ( x ) i s below t h e curve , we accept i f ( y < exp ( x ) ) s = s+ 1 . 0 ; The i n t e g r a l i s area enclosed below t h e l i n e f ( x ) =exp ( x ) } Then we m u l t i p l y w i t h t h e area o f t h e r e c t a n g l e and d i v i d e by t h e number o f c y c l e s I n t e g r a l = 3 . exp ( 3 . ) s / n
52 / 69

//

// //

//

// //

Acceptance-Rejection Method

Here it can be useful to split the program into subtasks A specic function which performs the Monte Carlo sampling A function which collects all data and performs statistical analysis and perhaps writes in parallel to le.

53 / 69

algo: Acceptance-Rejection Method


i n t main ( i n t argc , char argv [ ] ) { / / declarations .... / / MPI i n i t i a l i z a t i o n s M P I I n i t (& argc , &argv ) ; MPI Comm size (MPI COMM WORLD, &numprocs ) ; MPI Comm rank (MPI COMM WORLD, &my rank ) ; double t i m e s t a r t = MPI Wtime ( ) ; i f ( my rank == 0 && argc <= 1 ) { c o u t << "Bad Usage: " << argv [ 0 ] << " read also output file on same line" << e n d l ; } i f ( my rank == 0 && argc > 1 ) { o u t f i l e n a m e =argv [ 1 ] ; o f i l e . open ( o u t f i l e n a m e ) ; }
54 / 69

algo: Acceptance-Rejection Method


/ / Perform t h e i n t e g r a t i o n i n t e g r a t e ( MC samples , i n t e g r a l ) ; double time end = MPI Wtime ( ) ; double t o t a l t i m e = time end t i m e s t a r t ; i f ( my rank == 0 ) { c o u t << "Time = " << t o t a l t i m e << " on number of processors: " << numprocs << endl ; o f i l e << s e t i o s f l a g s ( i o s : : showpoint | i o s : : uppercase ) ; o f i l e << setw ( 1 5 ) << s e t p r e c i s i o n ( 8 ) << i n t e g r a l << e n d l ; o f i l e . close ( ) ; / / close output f i l e } / / End MPI MPI Finalize ( ) ; return 0; } / / end o f main f u n c t i o n
55 / 69

algo: Acceptance-Rejection Method

void i n t e g r a t e ( i n t number cycles , double & I n t e g r a l ) { double t o t a l n u m b e r c y c l e s ; double v a r i a n c e , energy , e r r o r ; double t o t a l c u m u l a t i v e , t o t a l c u m u l a t i v e 2 , cumulative , cumulative 2 ; t o t a l n u m b e r c y c l e s = number cycles numprocs ; / / Do t h e mc sampling cumulative = cumulative 2 = 0 . 0 ; total cumulative = total cumulative 2 = 0.0;

56 / 69

algo: Acceptance-Rejection Method


mc sampling ( number cycles , c u m u l a t i v e , cumulative 2 ) ; // C o l l e c t data i n t o t a l averages u s i n g MPI reduce M P I A l l r e d u c e (& c u m u l a t i v e , & t o t a l c u m u l a t i v e , 1 , MPI DOUBLE , MPI SUM , MPI COMM WORLD) ; M P I A l l r e d u c e (& c u m u l a t i v e 2 , & t o t a l c u m u l a t i v e 2 , 1 , MPI DOUBLE , MPI SUM , MPI COMM WORLD) ; I n t e g r a l = t o t a l c u m u l a t i v e / numprocs ; v a r i a n c e = t o t a l c u m u l a t i v e 2 / numprocs I n t e g r a l Integral ; e r r o r = s q r t ( v a r i a n c e / ( t o t a l n u m b e r c y c l e s 1.0) ) ; } / / end o f f u n c t i o n i n t e g r a t e

57 / 69

Matrix handling, Jacobis method

Parallel Jacobi Algorithm Different data distribution schemes Row-wise distribution Column-wise distribution Other alternatives not discussed here: Cyclic shifting

58 / 69

Matrix handling, Jacobis method

Direct solvers such as Gauss elimination and LU decomposition Iterative solvers such Basic iterative solvers, Jacobi, Gauss-Seidel, Successive over-relaxation Other iterative methods such as Krylov subspace methods with Generalized minimum residual (GMRES) and Conjugate gradient etc

59 / 69

Matrix handling, Jacobis method

It is a simple method for solving x = b, A is a matrix and x and b are vectors. The vector x is the where A unknown. It is an iterative scheme where after k + 1 iterations we have 1 (b (L +U )x(k ) ), x(k +1) = D =D +U +L and D being a diagonal matrix, U an upper with A a lower triangular matrix. triangular matrix and L

60 / 69

Matrix handling, Jacobis method


Shared memory or distributed memory: Shared-memory parallelization very straightforward Consider distributed memory machine using MPI Questions to answer in parallelization: Data distribution (data locality) How to distribute coefcient matrix among CPUs? How to distribute vector of unknowns? How to distribute RHS? Communication: What data needs to be communicated? Want to: Achieve data locality Minimize the number of communications Overlap communications with computations Load balance

61 / 69

Row-wise distribution

Assume dimension of matrix n n can be divided by number of CPUs P , m = n/P Blocks of m rows of coefcient matrix distributed to different CPUs; Vector of unknowns and RHS distributed similarly

62 / 69

Data to be communicated

Already have all on columns of matrix A each CPU; Only part of vector x is available on a CPU; Cannot carry out matrix vector multiplication directly; Need to communicate the vector x in the computations.

63 / 69

How to Communicate Vector x?


Gather partial vector x on each CPU to form the whole vector; Then matrix-vector multiplication on different CPUs proceed independently. Need MPI Allgather() function call All localdata are collected in olddata. Simple to implement, but A lot of communications Does not scale well for a large number of processors.

MPI_Allgather( void *localdata, int dim, void *olddata, int dim, MPI_Datatype dataty

64 / 69

How to Communicate Vector x?

Another method: Cyclic shift Shift partial vector x upward at each step; Do partial matrix-vector multiplication on each CPU at each step; After P steps (P is the number of CPUs), the overall matrix-vector multiplication is complete. Each CPU needs only to communicate with neighboring CPUs Provides opportunities to overlap communication with computations

65 / 69

Row-wise algo

66 / 69

Overlap Communications with Computations


Communications Each CPU needs to send its own partial vector x to upper neighboring CPU; Each CPU needs to receive data from lower neighboring CPU Overlap communications with computations: Each CPU does the following: Post non-blocking requests to send data to upper neighbor to to receive data from lower neighbor; This returns immediately Do partial computation with data currently available; Check non-blocking communication status; wait if necessary; Repeat above steps
67 / 69

Column-wise distribution

Blocks of m columns are of matrix A distributed among the different P CPUs Blocks of m rows of vectors x and bare distributed to different CPUs

68 / 69

Data to be communicated

Have already coefcient matrix data of m columns and a block of m rows of vector x. x can be A partial A computed on each CU independently. Need communication x to get the whole A using MPI Allreduce.

69 / 69

You might also like