Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Mpi Sweden Course

Download as pdf or txt
Download as pdf or txt
You are on page 1of 146

Advanced MPI Use

Victor Eijkhout eijkhout@tacc.utexas.edu


Uppsala MPI course

Eijkhout: MPI course 1


Table of Contents
Atomic operations
Shared memory
Advanced collectives
Process topologies
Fortran bindings
Big data communication
Partitioned communication
Sessions model
Other MPI-4 material
MPL: a C++ interface to MPI
Basics
Summary
Appendix: exercises

Eijkhout: MPI course 2


Materials

Textbooks and repositories:


https://theartofhpc.com

Eijkhout: MPI course 3


Part I

Atomic operations

Eijkhout: MPI course 4


2. Justification

MPI-1/2 lacked tools for race condition-free one-sided communication.


These have been added in MPI-3.

Eijkhout: MPI course 5


3. Emulating shared memory with one-sided
communication

One process stores a table of work descriptors, and a ‘stack pointer’


stating how many there are.
Each process reads the pointer, reads the corresponding descriptor,
and decrements the pointer; and
A process that has read a descriptor then executes the corresponding
task.
Non-collective behavior: processes only take a descriptor when they
are available.

Eijkhout: MPI course 6


4. Simplified model

One process has a counter, which models the shared memory;


Each process, if available, reads the counter; and
. . . decrements the counter.
No actual work: random decision if process is available.

Eijkhout: MPI course 7


5. Shared memory problems: what is a race condition?
Race condition: outward behavior depends on timing/synchronization of low-level events.
In shared memory associated with shared data.

Example:

Init: I=0
process 1: I=I+2
process 2: I=I+3

scenario 1. scenario 2. scenario 3.


I=0
read I = 0 read I = 0 read I = 0 read I = 0 read I = 0
local I = 2 local I = 3 local I = 2 local I = 3 local I = 2
write I = 2 write I = 3 write I = 2
write I = 3 write I = 2 read I = 2
local I = 5
write I = 5
I=3 I=2 I=5

(In MPI, the read/write would be MPI_Get / MPI_Put calls)

Eijkhout: MPI course 8


6. Case study in shared memory: 1, wrong

// countdownput.c
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( &counter_value,1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
if (i_am_available) {
int decrement = -1;
counter_value += decrement;
MPI_Put
( &counter_value, 1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
}
MPI_Win_fence(0,the_window);

Eijkhout: MPI course 9


7. Discussion

The multiple MPI_Put calls conflict.


Code is correct if in each iteration there is only one writer.
Question: In that case, can we take out the middle fence?
Question: what is wrong with
MPI_Win_fence(0,the_window);
if (i_am_available) {
MPI_Get( &counter_value, ... )
MPI_Win_fence(0,the_window);
MPI_Put( ... )
}
MPI_Win_fence(0,the_window);

Eijkhout: MPI course 10


8. Case study in shared memory: 2, hm

// countdownacc.c
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( &counter_value,1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
if (i_am_available) {
int decrement = -1;
MPI_Accumulate
( &decrement, 1,MPI_INT,
counter_process,0,1,MPI_INT,
MPI_SUM,
the_window);
}
MPI_Win_fence(0,the_window);

Eijkhout: MPI course 11


9. Discussion: need for atomics

MPI_Accumulate is atomic, so no conflicting writes.


What is the problem?
Answer: Processes are not reading unique counter_value values.
Conclusion: Read and update need to come together:
read unique value and immediately update.

Atomic ‘get-and-set-with-no-one-coming-in-between’:
MPI_Fetch_and_op / MPI_Get_accumulate.
Former is simple version: scalar only.

Eijkhout: MPI course 12


MPI_Fetch_and_op

Name Param name Explanation C type F type


MPI_Fetch_and_op (
origin_addr initial address of const void* TYPE(*),
buffer DIMENSION(..)
result_addr initial address of void* TYPE(*),
result buffer DIMENSION(..)
datatype datatype of the entry MPI_Datatype TYPE(MPI_Datatype
in origin, result, and
target buffers
target_rank rank of target int INTEGER
target_disp displacement from MPI_Aint INTEGER(KIND=MPI_
start of window to
beginning of target
buffer
op reduce operation MPI_Op TYPE(MPI_Op)
win window object MPI_Win TYPE(MPI_Win)
)

Eijkhout: MPI course 13


10. Case study in shared memory: 3, good
MPI_Win_fence(0,the_window);
int
counter_value;
if (i_am_available) {
int
decrement = -1;
total_decrement++;
MPI_Fetch_and_op
( /* operate with data from origin: */ &decrement,
/* retrieve data from target: */ &counter_value,
MPI_INT, counter_process, 0, MPI_SUM,
the_window);
}
MPI_Win_fence(0,the_window);
if (i_am_available) {
my_counter_values[n_my_counter_values++] = counter_value;
}

Eijkhout: MPI course 14


11. Allowable operators. (Hint!)
MPI type meaning applies to
MPI_MAX maximum integer, floating point
MPI_MIN minimum
MPI_SUM sum integer, floating point, complex, mul
MPI_REPLACE overwrite
MPI_NO_OP no change
MPI_PROD product
MPI_LAND logical and C integer, logical
MPI_LOR logical or
MPI_LXOR logical xor
MPI_BAND bitwise and integer, byte, multilanguage types
MPI_BOR bitwise or
MPI_BXOR bitwise xor
MPI_MAXLOC max value and location MPI_DOUBLE_INT and such
MPI_MINLOC min value and location

No user-defined operators.
Eijkhout: MPI course 15
12. Problem

We are using fences, which are collective.


What if a process is still operating on its local work?

Better (but more tricky) solution:


use passive target synchronization and locks.

Eijkhout: MPI course 16


13. Passive target epoch

if (rank == 0) {
MPI_Win_lock (MPI_LOCK_EXCLUSIVE, 1, 0, win);
MPI_Put (outbuf, n, MPI_INT, 1, 0, n, MPI_INT, win);
MPI_Win_unlock (1, win);
}

No action on the target required!

Eijkhout: MPI course 17


Exercise 1 (lockfetch)

Investigate atomic updates using passive target synchronization. Use


MPI_Win_lock with an exclusive lock, which means that each process only
acquires the lock when it absolutely has to.

All processs but one update a window:


int one=1;
MPI_Fetch_and_op(&one, &readout,
MPI_INT, repo, zero_disp, MPI_SUM,
the_win);

while the remaining process spins until the others have performed
their update.

Use an atomic operation for the latter process to read out the shared value.
Can you replace the exclusive lock with a shared one?

Eijkhout: MPI course 18


Solution
Update shared window:
int assert = 0; // MPI_MODE_NOCHECK;
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, repo, assert, the_window);
int one=1;
MPI_Fetch_and_op(&one, &readout, MPI_INT, repo,zero_disp, MPI_SUM,
,→the_window);
MPI_Win_unlock(repo,the_window);

Read-out of counter value:


// lockfetch.c
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, repo, assert, the_window);
int update=0;
MPI_Fetch_and_op
(&update, &readout ,
MPI_INT, repo,zero_disp, MPI_NO_OP, the_window);
MPI_Win_unlock(repo,the_window);

Eijkhout: MPI course 19


Exercise 2 (lockfetchshared)

As exercise 1, but now use a shared lock: all processes acquire the lock
simultaneously and keep it as long as is needed.

The problem here is that coherence between window buffers and local
variables is now not forced by a fence or releasing a lock. Use
MPI_Win_flush_local to force coherence of a window (on another process)
and the local variable from MPI_Fetch_and_op.

Eijkhout: MPI course 20


Solution
// lockfetchshared.c printf("Supervisor
MPI_Win_lock ,→readout: %d\n",
(MPI_LOCK_SHARED, ,→readout);
repo, assert, the_window); } while( readout<nprocs-1 );
if (procno == supervisor) { printf("Supervisor is
do { ,→done!\n");
/* } else {
* Exercise: read out the int one=1;
,→windows content using MPI_Fetch_and_op
,→an atomic operation (&one, &readout,
*/ MPI_INT, repo,zero_disp,
int update=0; MPI_SUM, the_window);
MPI_Fetch_and_op MPI_Win_flush_local
(&update, &readout , (repo,the_window);
MPI_INT, printf("[%d] adding 1 to
repo,zero_disp, ,→%d\n",procno,readout);
MPI_NO_OP, the_window); }
MPI_Win_flush_local MPI_Win_unlock(repo,the_window);
(repo,the_window);

Eijkhout: MPI course 21


Part II

Shared memory

Eijkhout: MPI course 22


14. Shared memory myths

Myth:
MPI processes use network calls, whereas OpenMP threads access
memory directly, therefore OpenMP is more efficient for shared
memory.
Truth:
MPI implementations use copy operations when possible, whereas
OpenMP has thread overhead, and affinity/coherence problems.
Main problem with MPI on shared memory: data duplication.

Eijkhout: MPI course 23


15. MPI shared memory

Shared memory access: two processes can access each other’s


memory through double* (and such) pointers, if they are on the
same shared memory.
Limitation: only window memory.
Non-use case: remote update. This has all the problems of traditional
shared memory (race conditions, consistency).
Good use case: every process needs access to large read-only dataset
Example: ray tracing.

Eijkhout: MPI course 24


16. Shared memory threatments in MPI

MPI uses optimizations for shared memory: copy instead of socket call
One-sided offers ‘fake shared memory’: yes, can access another
process’ data, but only through function calls.
MPI-3 shared memory gives you a pointer to another process’ memory,
if that process is on the same shared memory.

Eijkhout: MPI course 25


17. Shared memory per cluster node

Cluster node has shared memory


Memory is attached to specific socket
beware Non-Uniform Memory Access (NUMA) effects

Eijkhout: MPI course 26


18. Shared memory interface

Here is the high level overview; details next.

Use MPI_Comm_split_type to find processes on the same shared memory


Use MPI_Win_allocate_shared to create a window between processes on
the same shared memory
Use MPI_Win_shared_query to get pointer to another process’ window
data.
You can now use memcpy instead of MPI_Put.

Eijkhout: MPI course 27


19. Discover shared memory

MPI_Comm_split_type splits into communicators of same type.


Use type: MPI_COMM_TYPE_SHARED splitting by shared memory.
(MPI-4: split by other hardware features through
MPI_COMM_TYPE_HW_GUIDED and MPI_Get_hw_resource_types)
Code: Output:

// commsplittype.c make[3]: ‘commsplittype’ is up to date.


MPI_Info info; TACC: Starting up job 4356245
MPI_Comm_split_type TACC: Starting parallel tasks...
(MPI_COMM_WORLD, There are 10 ranks total
[0] is processor 0 in a shared group of 5, runni
MPI_COMM_TYPE_SHARED,
[5] is processor 0 in a shared group of 5, runni
procno,info,&sharedcomm); TACC: Shutdown complete. Exiting.
MPI_Comm_size
(sharedcomm,&new_nprocs);
MPI_Comm_rank
(sharedcomm,&new_procno);

Eijkhout: MPI course 28


Exercise 3

Write a program that uses MPI_Comm_split_type to analyze for a run

1 How many nodes there are;


2 How many processes there are on each node.

If you run this program on an unequal distribution, say 10 processes on


3 nodes, what distribution do you find?
Nodes: 3; processes: 10
TACC: Starting up job 4210429
TACC: Starting parallel tasks...
There are 3 nodes
Node sizes: 4 3 3
TACC: Shutdown complete. Exiting.

Eijkhout: MPI course 29


20. Allocate shared window

Use MPI_Win_allocate_shared to create a window that can be shared;

Has to be on a communicator on shared memory


Example: window is one double.
// sharedbulk.c
MPI_Win node_window;
MPI_Aint window_size; double *window_data;
if (onnode_procid==0)
window_size = sizeof(double);
else window_size = 0;
MPI_Win_allocate_shared
( window_size,sizeof(double),MPI_INFO_NULL,
nodecomm,
&window_data,&node_window);

For the full source of this example, see section ??

Eijkhout: MPI course 30


21. Get pointer to other windows

Use MPI_Win_shared_query:
MPI_Aint window_size0; int window_unit; double *win0_addr;
MPI_Win_shared_query
( node_window,0,
&window_size0,&window_unit, &win0_addr );

For the full source of this example, see section ??

Eijkhout: MPI course 31


MPI_Win_shared_query

Name Param name Explanation C type F type


MPI_Win_shared_query (
MPI_Win_shared_query_c (
win shared memory window MPI_Win TYPE(MPI_Win)
object
rank rank in the group int INTEGER
of window win or
MPI_PROC_NULL
size size of the window MPI_Aint* INTEGER(KIND=MPI_
segment 
int∗
disp_unit local unit size for INTEGER
MPI Aint
displacements, in
bytes
baseptr address for load/store void* TYPE(C_PTR)
access to window
segment
)

Eijkhout: MPI course 32


22. Allocated memory

Memory will be allocated contiguously


convenient for address arithmetic,
not for NUMA: set alloc_shared_noncontig true in MPI_Info object.

Example: each window stores one double. Measure distance in bytes:

Strategy: default behavior of shared Strategy: allow non-contiguous


window allocation shared window allocation

Distance 1 to zero: 8
Distance 2 to zero: 16 Distance 1 to zero: 4096
Distance 2 to zero: 8192

Question: what is going on here?

Eijkhout: MPI course 33


23. Exciting example: bulk data

Application: ray tracing:


large read-only data strcture describing the scene
traditional MPI would duplicate:
excessive memory demands
Better: allocate shared data on process 0 of the shared communicator
Everyone else points to this object.

Eijkhout: MPI course 34


Part III

Advanced collectives

Eijkhout: MPI course 35


24. Non-blocking collectives
Collectives are blocking.
Compare blocking/non-blocking sends:
MPI_Send → MPI_Isend
immediate return of control, produce request object.
Non-blocking collectives:
MPI_Bcast → MPI_Ibcast
Same:
MPI_Isomething( <usual arguments>, MPI_Request *req);
Considerations:
Calls return immediately;
the usual story about buffer reuse
Requires MPI_Wait... for completion.
Multiple collectives can complete in any order
Why?
Use for overlap communication/computation
Imbalance resilience
Allows pipelining
Eijkhout: MPI course 36
MPI_Ibcast

Name Param name Explanation C type F type


MPI_Ibcast (
MPI_Ibcast_c (
buffer starting address of void* TYPE(*),
buffer  DIMENSION(..)
int
count number of entries in INTEGER
MPI Count
buffer
datatype datatype of buffer MPI_Datatype TYPE(MPI_Datatype
root rank of broadcast root int INTEGER
comm communicator MPI_Comm TYPE(MPI_Comm)
request communication request MPI_Request* TYPE(MPI_Request)
)

Eijkhout: MPI course 37


25. Overlapping collectives

Independent collective and local operations:

y ← Ax + (x t x)y

MPI_Iallreduce( .... x ..., &request);


// compute the matrix vector product
MPI_Wait(request);
// do the addition

Eijkhout: MPI course 38


26. Simultaneous reductions

Do two reductions (on the same communicator) with different operators


simultaneously:
α ← xty
β ← ∥z∥∞
which translates to:
MPI_Request reqs[2];
MPI_Iallreduce
( &local_xy, &global_xy, 1,MPI_DOUBLE,MPI_SUM,comm,
&(reqs[0]) );
MPI_Iallreduce
( &local_xinf,&global_xin,1,MPI_DOUBLE,MPI_MAX,comm,
&(reqs[1]) );
MPI_Waitall(2,reqs,MPI_STATUSES_IGNORE);

Eijkhout: MPI course 39


27. Matching collectives

Blocking and non-blocking don’t match: either all processes call the
non-blocking or all call the blocking one. Thus the following code is
incorrect:
if (rank==root)
MPI_Reduce( &x /* ... */ root,comm );
else
MPI_Ireduce( &x /* ... */ root,comm,&req);

This is unlike the point-to-point behavior of non-blocking calls: you can


catch a message with MPI_Irecv that was sent with MPI_Send.

Eijkhout: MPI course 40


28. Transpose as gather/scatter

Every process needs to do a scatter or gather.

Eijkhout: MPI course 41


29. Simultaneous collectives

Transpose matrix by scattering all rows simultaneously.


Each scatter involves all processes, but with a different spanning tree.
MPI_Request scatter_requests[nprocs];
for (int iproc=0; iproc<nprocs; iproc++) {
MPI_Iscatter( regular,1,MPI_DOUBLE,
&(transpose[iproc]),1,MPI_DOUBLE,
iproc,comm,scatter_requests+iproc);
}
MPI_Waitall(nprocs,scatter_requests,MPI_STATUSES_IGNORE);

Eijkhout: MPI course 42


Persistent collectives

Eijkhout: MPI course 43


30. Persistent collectives (MPI-4)

Similar to persistent send/recv:


MPI_Allreduce_init( ...., &request );
for ( ... ) {
MPI_Start( request );
MPI_Wait( request );
}
MPI_Request_free( &request );

Available for all collectives and neighborhood collectives.

Eijkhout: MPI course 44


31. Example

// powerpersist.c
double localnorm,globalnorm=1.;
MPI_Request reduce_request;
MPI_Allreduce_init
( &localnorm,&globalnorm,1,MPI_DOUBLE,MPI_SUM,
comm,MPI_INFO_NULL,&reduce_request);
for (int it=0; it<10; it++) {
matmult(indata,outdata,buffersize);
localnorm = localsum(outdata,buffersize);
MPI_Start( &reduce_request );
MPI_Wait( &reduce_request,MPI_STATUS_IGNORE );
scale(outdata,indata,buffersize,1./sqrt(globalnorm));
}
MPI_Request_free( &reduce_request );

Note also the MPI_Info parameter.

Eijkhout: MPI course 45


32. Persistent vs non-blocking

Both request-based.

Non-blocking is ‘ad hoc’: buffer info not known before the collective
call.
Persistent allows ‘planning ahead’: management of internal buffers
and such.

Eijkhout: MPI course 46


Non-blocking barrier

Eijkhout: MPI course 47


33. Just what is a barrier?

Barrier is not time synchronization but state synchronization.


Test on non-blocking barrier: ‘has everyone reached some state’

Eijkhout: MPI course 48


34. Use case: adaptive refinement

Some processes decide locally to alter their structure


. . . need to communicate that to neighbors
Problem: neighbors don’t know whether to expect update calls, if at
all.
Solution:
send update msgs, if any;
then post barrier.
Everyone probe for updates, test for barrier.

Eijkhout: MPI course 49


35. Use case: distributed termination detection

Distributed termination detection (Matocha and Kamp, 1998):


draw a global conclusion with local operations
Everyone posts the barrier when done;
keeps doing local computation while testing for the barrier to
complete

Eijkhout: MPI course 50


MPI_Ibarrier

Name Param name Explanation C type F type


MPI_Ibarrier (
comm communicator MPI_Comm TYPE(MPI_Comm)
request communication request MPI_Request* TYPE(MPI_Request)
)

Eijkhout: MPI course 51


36. Step 1

Do sends, post barrier.


// ibarrierprobe.c
if (i_do_send) {
/*
* Pick a random process to send to,
* not yourself.
*/
int receiver = rand()%nprocs;
MPI_Ssend(&data,1,MPI_FLOAT,receiver,0,comm);
}
/*
* Everyone posts the non-blocking barrier
* and gets a request to test/wait for
*/
MPI_Request barrier_request;
MPI_Ibarrier(comm,&barrier_request);

Eijkhout: MPI course 52


37. Step 2

Poll for barrier and messages


for ( ; ; step++) {
int barrier_done_flag=0;
MPI_Test(&barrier_request,&barrier_done_flag,
MPI_STATUS_IGNORE);
//stop if you’re done!
if (barrier_done_flag) {
break;
} else {
// if you’re not done with the barrier:
int flag; MPI_Status status;
MPI_Iprobe
( MPI_ANY_SOURCE,MPI_ANY_TAG,
comm, &flag, &status );
if (flag) {
// absorb message!

Eijkhout: MPI course 53


Part IV

Process topologies

Eijkhout: MPI course 54


38. Overview

This section discusses topologies:

Cartesian topology
MPI-1 Graph topology
MPI-3 Graph topology

Commands learned:

MPI_Dist_graph_create, MPI_DIST_GRAPH, MPI_Dist_graph_neighbors_count


MPI_Neighbor_allgather and such

Eijkhout: MPI course 55


39. Process topologies

Processes don’t communicate at random


Example: Cartesian grid, each process 4 (or so) neighbors
Express operations in terms of topology
Elegance of expression
MPI can optimize

Eijkhout: MPI course 56


40. Process reordering

Consecutive process numbering often the best:


divide array by chunks
Not optimal for grids or general graphs:
MPI is allowed to renumbering ranks
Graph topology gives information from which to deduce renumbering

Eijkhout: MPI course 57


41. MPI-1 topology

Cartesian topology
Graph topology, globally specified.
Not scalable, do not use!

Eijkhout: MPI course 58


42. MPI-3 topology

Graph topologies locally specified: scalable!


Neighborhood collectives:
expression close to the algorith.

Eijkhout: MPI course 59


Graph topologies

Eijkhout: MPI course 60


43. Example: 5-point stencil

Neighbor exchange, spelled out:

Each process communicates down/right/up/left


Send and receive at the same time.
Can optimally be done in four steps

Eijkhout: MPI course 61


44. Step 1

Eijkhout: MPI course 62


45. Step 2

The middle node is blocked because all its targets are already receiving
or a channel is occupied:
one missed turn

Eijkhout: MPI course 63


46. Neighborhood collective
This is really a ‘local gather’:
each node does a gather from its neighbors in whatever order.
MPI_Neighbor_allgather

Eijkhout: MPI course 64


47. Why neighborhood collectives?

Using MPI_Isend / MPI_Irecv is like spelling out a collective;


Collectives can use pipelining as opposed to sending a whole buffer;
Collectives can use spanning trees as opposed to direct connections.

Eijkhout: MPI course 65


48. Create graph topology
int MPI_Dist_graph_create
(MPI_Comm comm_old, int nsources, const int sources[],
const int degrees[], const int destinations[],
const int weights[], MPI_Info info, int reorder,
MPI_Comm *comm_dist_graph)

nsources how many source nodes described? (Usually 1)


sources the processes being described (Usually MPI_Comm_rank value)
degrees how many processes to send to
destinations their ranks
weights: usually set to MPI_UNWEIGHTED.
info: MPI_INFO_NULL will do
reorder: 1 if dynamically reorder processes

Eijkhout: MPI course 66


49. Neighborhood collectives

int MPI_Neighbor_allgather
(const void *sendbuf, int sendcount,MPI_Datatype sendtype,
void *recvbuf, int recvcount, MPI_Datatype recvtype,
MPI_Comm comm)

Like an ordinary MPI_Allgather, but


the receive buffer has a length degree
(instead of comm size).

Eijkhout: MPI course 67


50. Neighbor querying

After MPI_Neighbor_allgather data in the buffer is not in normal rank order.

MPI_Dist_graph_neighbors_count gives actual number of neighbors.


(Why do you need this?)
MPI_Dist_graph_neighbors lists neighbor numbers.

Eijkhout: MPI course 68


MPI_Dist_graph_neighbors
Name Param name Explanation C type F type
MPI_Dist_graph_neighbors (
comm communicator with MPI_Comm TYPE(MPI_Comm)
distributed graph
topology
maxindegree size of sources and int INTEGER
sourceweights arrays
sources processes for which int[] INTEGER(maxinde
the calling process is
a destination
sourceweights weights of the edges int[] INTEGER(*)
into the calling
process
maxoutdegree size of destinations int INTEGER
and destweights arrays
destinations processes for which int[] INTEGER(maxoutd
the calling process is
a source
destweights weights of the edges int[] INTEGER(*)
out of the calling
process
)

Eijkhout: MPI course 69


51. Example: Systolic graph
Code:
// graph.c
for ( int i=0; i<=1; i++ ) {
int neighb_i = proci+i;
if (neighb_i<0 || neighb_i>=idim)
continue;
for (int j=0; j<=1; j++ ) {
int neighb_j = procj+j;
if (neighb_j<0 || neighb_j>=jdim)
continue;
destinations[ degree++ ] =
PROC(neighb_i,neighb_j,idim,jdim);
}
}
MPI_Dist_graph_create
(comm,
/* I specify just one proc: me */ 1,
&procno,&degree,destinations,weights,
MPI_INFO_NULL,0,
&comm2d
);
Eijkhout: MPI course 70
52. output

Code: Output:

int indegree,outdegree, [ 0 = (0,0)] has 4 outbound: 0, 1, 2, 3,


weighted; 1 inbound: (0,0)=0
MPI_Dist_graph_neighbors_count [ 1 = (0,1)] has 2 outbound: 1, 3,
(comm2d, 2 inbound: (0,1)=1 (0,0)=0
[ 2 = (1,0)] has 4 outbound: 2, 3, 4, 5,
&indegree,&outdegree,
2 inbound: (1,0)=2 (0,0)=0
&weighted); [ 3 = (1,1)] has 2 outbound: 3, 5,
int 4 inbound: (1,1)=3 (1,0)=2 (0,1)=1 (0,0)=0
my_ij[2] = {proci,procj}, [ 4 = (2,0)] has 2 outbound: 4, 5,
other_ij[4][2]; 2 inbound: (2,0)=4 (1,0)=2
MPI_Neighbor_allgather [ 5 = (2,1)] has 1 outbound: 5,
( my_ij,2,MPI_INT, 4 inbound: (2,1)=5 (1,1)=3 (2,0)=4 (1,0)=2
other_ij,2,MPI_INT,
comm2d );

Eijkhout: MPI course 71


Exercise 4 (rightgraph)

Earlier rightsend exercise

Revisit exercise 5 and solve it using MPI_Dist_graph_create. Use figure 53


for inspiration.

Use a degree value of 1.

Eijkhout: MPI course 72


53. Inspiring picture for the previous exercise

Solving the right-send exercise with neighborhood collectives

Eijkhout: MPI course 73


54. Hints for the previous exercise

Two approaches:

1 Declare just one source: the previous process. Do this! Or:


2 Declare two sources: the previous and yourself. In that case bear in
mind slide 50.

Eijkhout: MPI course 74


55. More graph collectives

Heterogeneous: MPI_Neighbor_alltoallw.
Non-blocking: MPI_Ineighbor_allgather and such
Persistent: MPI_Neighbor_allgather_init,
MPI_Neighbor_allgatherv_init.

Eijkhout: MPI course 75


MPI-4

Eijkhout: MPI course 76


Justification

Version 3 of the MPI standard has added a number of features, some


geared purely towards functionality, others with an eye towards efficiency
at exascale.

Version 4 adds yet more features for exascale, and more flexible process
management.

Note: MPI-3 as of 2012, 3.1 as of 2015. Fully supported everywhere.


MPI-4 as of June 2021. Partial support in mpich version 4.1.

Eijkhout: MPI course 77


Part V

Fortran bindings

Eijkhout: MPI course 78


56. Overview

The Fortran interface to MPI had some defects. With Fortran2008 these
have been largely repaired.

The trailing error parameter is now optional;


MPI data types are now actual Type objects, rather than Integer
Strict type checking on arguments.

Eijkhout: MPI course 79


57. MPI headers

New module:

use mpi_f08 ! for Fortran2008


use mpi ! for Fortran90

True Fortran bindings as of the 2008 standard. Provided in

Intel compiler version 18 or newer,


gcc 9 and later (not with Intel MPI, use mvapich).

Eijkhout: MPI course 80


58. Optional error parameter

Old Fortran90 style: New Fortran2008 style:


call MPI_Init(ierr) call MPI_Init()
! your code ! your code
call MPI_Finalize(ierr) call MPI_Finalize()

Eijkhout: MPI course 81


59. Communicators

!! Fortran 2008 interface


use mpi_f08
Type(MPI_Comm) :: comm = MPI_COMM_WORLD

!! Fortran legacy interface


#include <mpif.h>
! or: use mpi
Integer :: comm = MPI_COMM_WORLD

Eijkhout: MPI course 82


60. Requests

Requests are also derived types


note that ...NULL entities are now objects, not integers
!! waitnull.F90
Type(MPI_Request),dimension(:),allocatable :: requests
allocate(requests(ntids-1))
call MPI_Waitany(ntids-1,requests,index,MPI_STATUS_IGNORE)
if ( .not. requests(index)==MPI_REQUEST_NULL) then
print *,"This request should be null:",index

(Q for the alert student: do you see anything halfway remarkable about
that index?)

Eijkhout: MPI course 83


61. More

Type(MPI_Datatype) :: newtype ! F2008


Integer :: newtype ! F90

Also: MPI_Comm, MPI_Datatype, MPI_Errhandler, MPI_Group, MPI_Info, MPI_File,


MPI_Op, MPI_Request, MPI_Status, MPI_Win

Eijkhout: MPI course 84


62. Status
Fortran2008: status is a Type with fields:
!! anysource.F90
Type(MPI_Status) :: status
allocate(recv_buffer(ntids-1))
do p=0,ntids-2
call MPI_Recv(recv_buffer(p+1),1,MPI_INTEGER,&
MPI_ANY_SOURCE,0,comm,status)
sender = status%MPI_SOURCE

Fortran90: status is an array with named indexing


!! anysource.F90
integer :: status(MPI_STATUS_SIZE)
allocate(recv_buffer(ntids-1))
do p=0,ntids-2
call MPI_Recv(recv_buffer(p+1),1,MPI_INTEGER,&
MPI_ANY_SOURCE,0,comm,status,err)
sender = status(MPI_SOURCE)

Eijkhout: MPI course 85


63. Type checking

Type checking catches potential problems:


!! typecheckarg.F90
integer,parameter :: n=2
Integer,dimension(n) :: source
call MPI_Init()
call MPI_Send(source,MPI_INTEGER,n, &
1,0,MPI_COMM_WORLD)

typecheck.F90(20): error #6285:


There is no matching specific subroutine
for this generic subroutine call. [MPI_SEND]
call MPI_Send(source,MPI_INTEGER,n,
-------^

Eijkhout: MPI course 86


64. Type checking’

Type checking does not catch all problems:


!! typecheckbuf.F90
integer,parameter :: n=1
Real,dimension(n) :: source
call MPI_Init()
call MPI_Send(source,n,MPI_INTEGER, &
1,0,MPI_COMM_WORLD)

Buffer/type mismatch is not caught.

Eijkhout: MPI course 87


Part VI

Big data communication

Eijkhout: MPI course 88


65. Overview

This section discusses big messages.

Commands learned:

MPI_Send_c, MPI_Allreduce_c, MPI_Get_count_c (MPI-4)


MPI_Get_elements_x, MPI_Type_get_extent_x,
MPI_Type_get_true_extent_x (MPI-3)

Eijkhout: MPI course 89


66. The problem with large messages

There is no problem allocating large buffers:


size_t bigsize = 1<<33;
double *buffer =
(double*) malloc(bigsize*sizeof(double));

But you can not tell MPI how big the buffer is:
MPI_Send(buffer,bigsize,MPI_DOUBLE,...) // WRONG

because the size argument has to be int.

Eijkhout: MPI course 90


67. MPI 3 count type
Count type since MPI 3
C:
MPI_Count count;

Fortran:
Integer(kind=MPI_COUNT_KIND) :: count

Big enough for

int;
MPI_Aint, used in one-sided;
MPI_Offset, used in file I/O.

However, this type could not be used in MPI-3 to describe send buffers.

Eijkhout: MPI course 91


68. MPI 4 large count routines

C: routines with _c suffix


MPI_Count count;
MPI_Send_c( buff,count,MPI_INT, ... );

also MPI_Reduce_c, MPI_Get_c, . . . (some 190 routines in all)

Fortran: polymorphism rules


Integer(kind=MPI_COUNT_KIND) :: count
call MPI_Send( buff,count, MPI_INTEGER, ... )

Eijkhout: MPI course 92


69. Big count example

// pingpongbig.c
assert( sizeof(MPI_Count)>4 );
for ( int power=3; power<=10; power++) {
MPI_Count length=pow(10,power);
buffer = (double*)malloc( length*sizeof(double) );
MPI_Ssend_c
(buffer,length,MPI_DOUBLE,
processB,0,comm);
MPI_Recv_c
(buffer,length,MPI_DOUBLE,
processB,0,comm,MPI_STATUS_IGNORE);

Eijkhout: MPI course 93


MPI_Send

Name Param name Explanation C type F type


MPI_Send (
MPI_Send_c (
buf initial address of const void* TYPE(*),
send buffer  DIMENSION(..)
int
count number of elements in INTEGER
MPI Count
send buffer
datatype datatype of each send MPI_Datatype TYPE(MPI_Datatype
buffer element
dest rank of destination int INTEGER
tag message tag int INTEGER
comm communicator MPI_Comm TYPE(MPI_Comm)
)

Eijkhout: MPI course 94


70. MPI 4 large count querying

C:
MPI_Count count;
MPI_Get_count_c( &status,MPI_INT, &count );
MPI_Get_elements_c( &status,MPI_INT, &count );

Fortran:
Integer(kind=MPI_COUNT_KIND) :: count
call MPI_Get_count( status,MPI_INTEGER,count )
call MPI_Get_elements( status,MPI_INTEGER,count )

Eijkhout: MPI course 95


71. MPI 3 kludge: use semi-large types

Make a derived datatype, and send a couple of those:


MPI_Datatype blocktype;
MPI_Type_contiguous(mediumsize,MPI_FLOAT,&blocktype);
MPI_Type_commit(&blocktype);
if (procno==sender) {
MPI_Send(source,nblocks,blocktype,receiver,0,comm);

You can even receive them:


} else if (procno==receiver) {
MPI_Status recv_status;
MPI_Recv(target,nblocks,blocktype,sender,0,comm,
&recv_status);

Eijkhout: MPI course 96


72. Large int counting

MPI-3 mechanism, deprecated (probably) in MPI-4.1:

By composing types you can make a ‘big type’. Use


MPI_Type_get_extent_x, MPI_Type_get_true_extent_x, MPI_Get_elements_x
to query.
MPI_Count recv_count;
MPI_Get_elements_x(&recv_status,MPI_FLOAT,&recv_count);

Eijkhout: MPI course 97


Part VII

Partitioned communication

Eijkhout: MPI course 98


73. Partitioned communication (MPI-4)

Hybrid scenario:
multiple threads contribute to one large message

Partitioned send/recv:
the contributions can be declared/tested

Eijkhout: MPI course 99


74. Create partitions

// partition.c
int bufsize = nparts*SIZE;
int *partitions = (int*)malloc((nparts+1)*sizeof(int));
for (int ip=0; ip<=nparts; ip++)
partitions[ip] = ip*SIZE;
if (procno==src) {
double *sendbuffer = (double*)malloc(bufsize*sizeof(double));

Eijkhout: MPI course 100


75. Init calls

Similar to init calls for persistent sends,


but specify the number of partitions.
MPI_Psend_init
(sendbuffer,nparts,SIZE,MPI_DOUBLE,tgt,0,
comm,MPI_INFO_NULL,&send_request);
MPI_Precv_init
(recvbuffer,nparts,SIZE,MPI_DOUBLE,src,0,
comm,MPI_INFO_NULL,&recv_request);

Eijkhout: MPI course 101


76. Partitioned send

MPI_Request send_request;
MPI_Psend_init
(sendbuffer,nparts,SIZE,MPI_DOUBLE,tgt,0,
comm,MPI_INFO_NULL,&send_request);
for (int it=0; it<ITERATIONS; it++) {
MPI_Start(&send_request);
for (int ip=0; ip<nparts; ip++)
fill_buffer(sendbuffer,partitions[ip],partitions[ip+],ip);
MPI_Pready(ip,send_request);
MPI_Wait(&send_request,MPI_STATUS_IGNORE);
}
MPI_Request_free(&send_request);

Eijkhout: MPI course 102


77. Partitioned receive

double *recvbuffer = (double*)malloc(bufsize*sizeof(double));


MPI_Request recv_request;
MPI_Precv_init
(recvbuffer,nparts,SIZE,MPI_DOUBLE,src,0,
comm,MPI_INFO_NULL,&recv_request);
for (int it=0; it<ITERATIONS; it++) {
MPI_Start(&recv_request);
MPI_Wait(&recv_request,MPI_STATUS_IGNORE);
int r = 1;
for (ip=0; ip<nparts; ip++)
r *= chck_buffer(recvbuffer,partitions[ip],partitions[ip+1],ip);
}
MPI_Request_free(&recv_request);

Eijkhout: MPI course 103


78. Partitioned receive tests

Use
MPI_Parrived(recv_request,ipart,&flag);

to test for arrived partitions.

Eijkhout: MPI course 104


Part VIII

Sessions model

Eijkhout: MPI course 105


79. Problems with the ‘world model’

MPI is started exactly once:

MPI can not close down and restart.


Libraries using MPI need to agree on threading and such.

Eijkhout: MPI course 106


80. Sketch of a solution

Eijkhout: MPI course 107


81. World and session model

World model: what you have been doing so far;


Start with MPI_COMM_WORLD and make subcommunicators,
or spawn new world communicators and bridge them
Session model: have multiple sessions active,
each starting/ending MPI separately.

Eijkhout: MPI course 108


82. Session model

Create a session;
a session has multiple ‘process sets’
from a process set you make a communicator;
Potentially create multiple sessions in one program run
Can not mix objects from multiple simultaneous sessions

Eijkhout: MPI course 109


83. Session creating

// session.c
MPI_Info session_request_info = MPI_INFO_NULL;
MPI_Info_create(&session_request_info);
char thread_key[] = "mpi_thread_support_level";
MPI_Info_set(session_request_info,
thread_key,"MPI_THREAD_MULTIPLE");

Info object can also be MPI_INFO_NULL,


then
MPI_Session the_session;
MPI_Session_init
( session_request_info,MPI_ERRORS_ARE_FATAL,
&the_session );
MPI_Session_finalize( &the_session );

Eijkhout: MPI course 110


84. Session: process sets

Process sets, identified by name (not a data type):


int npsets;
MPI_Session_get_num_psets
( the_session,MPI_INFO_NULL,&npsets );
if (mainproc) printf("Number of process sets: %d\n",npsets);
for (int ipset=0; ipset<npsets; ipset++) {
int len_pset; char name_pset[MPI_MAX_PSET_NAME_LEN];
MPI_Session_get_nth_pset( the_session,MPI_INFO_NULL,
ipset,&len_pset,name_pset );
if (mainproc)
printf("Process set %2d: <<%s>>\n",ipset,name_pset);

the sets mpi://SELF and mpi://WORLD are always defined.

Eijkhout: MPI course 111


85. Session: create communicator

Process set → group → communicator


MPI_Group world_group = MPI_GROUP_NULL;
MPI_Comm world_comm = MPI_COMM_NULL;
MPI_Group_from_session_pset
( the_session,world_name,&world_group );
MPI_Comm_create_from_group
( world_group,"victor-code-session.c",
MPI_INFO_NULL,MPI_ERRORS_ARE_FATAL,
&world_comm );
MPI_Group_free( &world_group );
int procid = -1, nprocs = 0;
MPI_Comm_size(world_comm,&nprocs);
MPI_Comm_rank(world_comm,&procid);

Eijkhout: MPI course 112


86. Multiple sessions

// sessionmulti.c
MPI_Info info1 = MPI_INFO_NULL, info2 = MPI_INFO_NULL;
char thread_key[] = "mpi_thread_support_level";
MPI_Info_create(&info1); MPI_Info_create(&info2);
MPI_Info_set(info1,thread_key,"MPI_THREAD_SINGLE");
MPI_Info_set(info2,thread_key,"MPI_THREAD_MULTIPLE");
MPI_Session session1,session2;
MPI_Session_init( info1,MPI_ERRORS_ARE_FATAL,&session1 );
MPI_Session_init( info2,MPI_ERRORS_ARE_FATAL,&session2 );

Eijkhout: MPI course 113


87. Practical use: libraries
// sessionlib.cxx
class Library {
private:
MPI_Comm world_comm; MPI_Session session;
public:
Library() {
MPI_Info info = MPI_INFO_NULL;
MPI_Session_init
( MPI_INFO_NULL,MPI_ERRORS_ARE_FATAL,&session );
char world_name[] = "mpi://WORLD";
MPI_Group world_group;
MPI_Group_from_session_pset
( session,world_name,&world_group );
MPI_Comm_create_from_group
( world_group,"world-session",
MPI_INFO_NULL,MPI_ERRORS_ARE_FATAL,
&world_comm );
MPI_Group_free( &world_group );
};
~Library() { MPI_Session_finalize(&session); };

Eijkhout: MPI course 114


88. Practical use: main

int main(int argc,char **argv) {

Library lib1,lib2;
MPI_Init(0,0);
MPI_Comm world = MPI_COMM_WORLD;
int procno,nprocs;
MPI_Comm_rank(world,&procno);
MPI_Comm_size(world,&nprocs);
auto sum1 = lib1.compute(procno);
auto sum2 = lib2.compute(procno+1);

Eijkhout: MPI course 115


Part IX

Other MPI-4 material

Eijkhout: MPI course 116


89. Better aborts

Error handler MPI_ERRORS_ABORT: aborts on the processes in the


communicator for which it is specified.
Error code MPI_ERR_PROC_ABORTED: process tried to communicate with a
process that has aborted.

Eijkhout: MPI course 117


90. Error as C-string

MPI_Info_get and MPI_Info_get_valuelen are not robust with respect to the


null terminator.
Replace by:
int MPI_Info_get_string
(MPI_Info info, const char *key,
int *buflen, char *value, int *flag)

Eijkhout: MPI course 118


91. Comm split by hw type

MPI_Comm_split_type has exactly one type in MPI-3: MPI_COMM_TYPE_SHARED

MPI-4: types

MPI_COMM_TYPE_HW_GUIDED use info to specify hardware type


MPI_COMM_TYPE_HW_UNGUIDED, same but strict subset

Query types with MPI_Get_hw_resource_types.

Eijkhout: MPI course 119


Part X

MPL: a C++ interface to MPI

Eijkhout: MPI course 120


Justification

While the C API to MPI is usable from C++, it feels very unidiomatic for
that language. Message Passing Layer (MPL) is a modern C++11
interface to MPI. It is both idiomatic and elegant, simplifying many calling
sequences. It is very low overhead.

Eijkhout: MPI course 121


Part XI

Basics

Eijkhout: MPI course 122


92. Rank and size

The rank of a process (by mpl::communicator::rank) and the size of a


communicator (by mpl::communicator::size) are both methods of the
communicator class:
const mpl::communicator &comm_world =
mpl::environment::comm_world();
int procid = comm_world.rank();
int nprocs = comm_world.size();

Eijkhout: MPI course 123


93. Scalar buffers

Buffer type handling is done through polymorphism and templating: no


explicit indiation of types.

Scalars are handled as such:


float x,y;
comm.bcast( 0,x ); // note: root first
comm.allreduce( mpl::plus<float>(), x,y ); // op first

where the reduction function needs to be compatible with the type of the
buffer.

Eijkhout: MPI course 124


94. Vector buffers

If your buffer is a std::vector you need to take the .data() component of


it:
vector<float> xx(2),yy(2);
comm.allreduce( mpl::plus<float>(),
xx.data(), yy.data(), mpl::contiguous_layout<float>(2) );

The contiguous_layout is a ‘derived type’; this will be discussed in more


detail elsewhere (see note ?? and later). For now, interpret it as a way of
indicating the count/type part of a buffer specification.

Eijkhout: MPI course 125


Collectives

Eijkhout: MPI course 126


95. Reduce on non-root processes

There is a separate variant for non-root usage of rooted collectives:


// scangather.cxx
if (procno==0) {
comm_world.reduce
( mpl::plus<int>(),0,
my_number_of_elements,total_number_of_elements );
} else {
comm_world.reduce
( mpl::plus<int>(),0,my_number_of_elements );
}

Eijkhout: MPI course 127


96. User defined operators

A user-defined operator can be a templated class with an operator().


Example:
// reduceuser.cxx
template<typename T>
class lcm {
public:
T operator()(T a, T b) {
T zero=T();
T t((a/gcd(a, b))*b);
if (t<zero)
return -t;
return t;
}

comm_world.reduce(lcm<int>(), 0, v, result);

Eijkhout: MPI course 128


97. Lambda reduction operators

You can also do the reduction by lambda:


comm_world.reduce
( [] (int i,int j) -> int
{ return i+j; },
0,data );

Eijkhout: MPI course 129


98. Nonblocking collectives

Nonblocking collectives have the same argument list as the corresponding


blocking variant, except that instead of a void result, they return an
irequest. (See 101)
// ireducescalar.cxx
float x{1.},sum;
auto reduce_request =
comm_world.ireduce(mpl::plus<float>(), 0, x, sum);
reduce_request.wait();
if (comm_world.rank()==0) {
std::cout << "sum = " << sum << ’\n’;
}

Eijkhout: MPI course 130


Point-to-point communication

Eijkhout: MPI course 131


99. Blocking send and receive

MPL uses a default value for the tag, and it can deduce the type of the
buffer. Sending a scalar becomes:
// sendscalar.cxx
if (comm_world.rank()==0) {
double pi=3.14;
comm_world.send(pi, 1); // send to rank 1
cout << "sent: " << pi << ’\n’;
} else if (comm_world.rank()==1) {
double pi=0;
comm_world.recv(pi, 0); // receive from rank 0
cout << "got : " << pi << ’\n’;
}

For the full source of this example, see section ??

Eijkhout: MPI course 132


100. Sending arrays

MPL can send static arrays without further layout specification:


// sendarray.cxx
double v[2][2][2];
comm_world.send(v, 1); // send to rank 1
comm_world.recv(v, 0); // receive from rank 0

For the full source of this example, see section ??

Sending vectors uses a general mechanism:


// sendbuffer.cxx
std::vector<double> v(8);
mpl::contiguous_layout<double> v_layout(v.size());
comm_world.send(v.data(), v_layout, 1); // send to rank 1
comm_world.recv(v.data(), v_layout, 0); // receive from rank 0

For the full source of this example, see section ??

Eijkhout: MPI course 133


101. Requests from nonblocking calls
Nonblocking routines have an irequest as function result. Note: not a
parameter passed by reference, as in the C interface. The various wait calls
are methods of the irequest class.
double recv_data;
mpl::irequest recv_request =
comm_world.irecv( recv_data,sender );
recv_request.wait();
For the full source of this example, see section ??
You can not default-construct the request variable:
// DOES NOT COMPILE:
mpl::irequest recv_request;
recv_request = comm.irecv( ... );

This means that the normal sequence of first declaring, and then filling in,
the request variable is not possible.
MPL implementation note: The wait call always returns a
status object; not assigning it means that the destructor is called
Eijkhout: MPI course 134
102. Request pools

Instead of an array of requests, use an irequest_pool object, which acts


like a vector of requests, meaning that you can push onto it.
// irecvsource.cxx
mpl::irequest_pool recv_requests;
for (int p=0; p<nprocs-1; p++) {
recv_requests.push( comm_world.irecv( recv_buffer[p], p ) );
}

For the full source of this example, see section ??

You can not declare a pool of a fixed size and assign elements.

Eijkhout: MPI course 135


103. Request handling

auto [success,index] = recv_requests.waitany();


if (success) {
auto recv_status = recv_requests.get_status(index);

Eijkhout: MPI course 136


Derived Datatypes

Eijkhout: MPI course 137


104. Vector type

MPL has the strided_vector_layout class as equivalent of the vector type:

// vector.cxx
vector<double>
source(stride*count);
if (procno==sender) {
mpl::strided_vector_layout<double>
newvectortype(count,1,stride);
comm_world.send
(source.data(),newvectortype,the_other);
}

For the full source of this example, see section ??

(See note ?? for nonstrided vectors.)

Eijkhout: MPI course 138


Communicator manipulations

Eijkhout: MPI course 139


105. Communicator splitting
In MPL, splitting a communicator is done as one of the overloads of the
communicator constructor;
// commsplit.cxx
// create sub communicator modulo 2
int color2 = procno % 2;
mpl::communicator comm2( mpl::communicator::split, comm_world,
,→color2 );
auto procno2 = comm2.rank();

// create sub communicator modulo 4 recursively


int color4 = procno2 % 2;
mpl::communicator comm4( mpl::communicator::split, comm2, color4 );
auto procno4 = comm4.rank();
For the full source of this example, see section ??
MPL implementation note: The communicator::split identifier
is an object of class communicator::split_tag, itself is an otherwise
empty subclass of communicator:
class split_tag {};
static
Eijkhout: MPI courseconstexpr split_tag split{}; 140
Part XII

Summary

Eijkhout: MPI course 141


106. Summary

Atomic one-sided communication and shared memory (MPI-3)


Non-blocking collectives (MPI-3) and persistent collectives (MPI-4)
Graph topologies (MPI-3)
Fortran 2008 bindings (MPI-3)
MPI_Count arguments for large buffers (MPI-4)
Partitioned sends (MPI-4)
Sessions model (MPI-4)
C++ MPL

Eijkhout: MPI course 142


Supplemental material

Eijkhout: MPI course 143


Part XIII

Appendix: exercises

Eijkhout: MPI course 144


Exercise 5 (serialsend)

(Classroom exercise) Each student holds a piece of paper in the right hand
– keep your left hand behind your back – and we want to execute:

1 Give the paper to your right neighbor;


2 Accept the paper from your left neighbor.

Including boundary conditions for first and last process, that becomes the
following program:

1 If you are not the rightmost student, turn to the right and give the
paper to your right neighbor.
2 If you are not the leftmost student, turn to your left and accept the
paper from your left neighbor.

Eijkhout: MPI course 145


Exercise 6 (procgrid)
Organize your processes in a grid, and make subcommunicators for the
rows and columns. For this compute the row and column number of each
process.
In the row and column communicator, compute the rank. For instance, on
a 2 × 3 processor grid you should find:

Global ranks: Ranks in row: Ranks in colum:


0 1 2 0 1 2 0 0 0
3 4 5 0 1 2 1 1 1

Check that the rank in the row communicator is the column number, and
the other way around.
Run your code on different number of processes, for instance a number of
rows and columns that is a power of 2, or that is a prime number. This is
one occasion where you could use ibrun -np 9; normally you would
never put a processor count on ibrun.
Eijkhout: MPI course 146

You might also like