Mpi Sweden Course
Mpi Sweden Course
Mpi Sweden Course
Atomic operations
Example:
Init: I=0
process 1: I=I+2
process 2: I=I+3
// countdownput.c
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( &counter_value,1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
if (i_am_available) {
int decrement = -1;
counter_value += decrement;
MPI_Put
( &counter_value, 1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
}
MPI_Win_fence(0,the_window);
// countdownacc.c
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( &counter_value,1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
if (i_am_available) {
int decrement = -1;
MPI_Accumulate
( &decrement, 1,MPI_INT,
counter_process,0,1,MPI_INT,
MPI_SUM,
the_window);
}
MPI_Win_fence(0,the_window);
Atomic ‘get-and-set-with-no-one-coming-in-between’:
MPI_Fetch_and_op / MPI_Get_accumulate.
Former is simple version: scalar only.
No user-defined operators.
Eijkhout: MPI course 15
12. Problem
if (rank == 0) {
MPI_Win_lock (MPI_LOCK_EXCLUSIVE, 1, 0, win);
MPI_Put (outbuf, n, MPI_INT, 1, 0, n, MPI_INT, win);
MPI_Win_unlock (1, win);
}
while the remaining process spins until the others have performed
their update.
Use an atomic operation for the latter process to read out the shared value.
Can you replace the exclusive lock with a shared one?
As exercise 1, but now use a shared lock: all processes acquire the lock
simultaneously and keep it as long as is needed.
The problem here is that coherence between window buffers and local
variables is now not forced by a fence or releasing a lock. Use
MPI_Win_flush_local to force coherence of a window (on another process)
and the local variable from MPI_Fetch_and_op.
Shared memory
Myth:
MPI processes use network calls, whereas OpenMP threads access
memory directly, therefore OpenMP is more efficient for shared
memory.
Truth:
MPI implementations use copy operations when possible, whereas
OpenMP has thread overhead, and affinity/coherence problems.
Main problem with MPI on shared memory: data duplication.
MPI uses optimizations for shared memory: copy instead of socket call
One-sided offers ‘fake shared memory’: yes, can access another
process’ data, but only through function calls.
MPI-3 shared memory gives you a pointer to another process’ memory,
if that process is on the same shared memory.
Use MPI_Win_shared_query:
MPI_Aint window_size0; int window_unit; double *win0_addr;
MPI_Win_shared_query
( node_window,0,
&window_size0,&window_unit, &win0_addr );
Distance 1 to zero: 8
Distance 2 to zero: 16 Distance 1 to zero: 4096
Distance 2 to zero: 8192
Advanced collectives
y ← Ax + (x t x)y
Blocking and non-blocking don’t match: either all processes call the
non-blocking or all call the blocking one. Thus the following code is
incorrect:
if (rank==root)
MPI_Reduce( &x /* ... */ root,comm );
else
MPI_Ireduce( &x /* ... */ root,comm,&req);
// powerpersist.c
double localnorm,globalnorm=1.;
MPI_Request reduce_request;
MPI_Allreduce_init
( &localnorm,&globalnorm,1,MPI_DOUBLE,MPI_SUM,
comm,MPI_INFO_NULL,&reduce_request);
for (int it=0; it<10; it++) {
matmult(indata,outdata,buffersize);
localnorm = localsum(outdata,buffersize);
MPI_Start( &reduce_request );
MPI_Wait( &reduce_request,MPI_STATUS_IGNORE );
scale(outdata,indata,buffersize,1./sqrt(globalnorm));
}
MPI_Request_free( &reduce_request );
Both request-based.
Non-blocking is ‘ad hoc’: buffer info not known before the collective
call.
Persistent allows ‘planning ahead’: management of internal buffers
and such.
Process topologies
Cartesian topology
MPI-1 Graph topology
MPI-3 Graph topology
Commands learned:
Cartesian topology
Graph topology, globally specified.
Not scalable, do not use!
The middle node is blocked because all its targets are already receiving
or a channel is occupied:
one missed turn
int MPI_Neighbor_allgather
(const void *sendbuf, int sendcount,MPI_Datatype sendtype,
void *recvbuf, int recvcount, MPI_Datatype recvtype,
MPI_Comm comm)
Code: Output:
Two approaches:
Heterogeneous: MPI_Neighbor_alltoallw.
Non-blocking: MPI_Ineighbor_allgather and such
Persistent: MPI_Neighbor_allgather_init,
MPI_Neighbor_allgatherv_init.
Version 4 adds yet more features for exascale, and more flexible process
management.
Fortran bindings
The Fortran interface to MPI had some defects. With Fortran2008 these
have been largely repaired.
New module:
(Q for the alert student: do you see anything halfway remarkable about
that index?)
Commands learned:
But you can not tell MPI how big the buffer is:
MPI_Send(buffer,bigsize,MPI_DOUBLE,...) // WRONG
Fortran:
Integer(kind=MPI_COUNT_KIND) :: count
int;
MPI_Aint, used in one-sided;
MPI_Offset, used in file I/O.
However, this type could not be used in MPI-3 to describe send buffers.
// pingpongbig.c
assert( sizeof(MPI_Count)>4 );
for ( int power=3; power<=10; power++) {
MPI_Count length=pow(10,power);
buffer = (double*)malloc( length*sizeof(double) );
MPI_Ssend_c
(buffer,length,MPI_DOUBLE,
processB,0,comm);
MPI_Recv_c
(buffer,length,MPI_DOUBLE,
processB,0,comm,MPI_STATUS_IGNORE);
C:
MPI_Count count;
MPI_Get_count_c( &status,MPI_INT, &count );
MPI_Get_elements_c( &status,MPI_INT, &count );
Fortran:
Integer(kind=MPI_COUNT_KIND) :: count
call MPI_Get_count( status,MPI_INTEGER,count )
call MPI_Get_elements( status,MPI_INTEGER,count )
Partitioned communication
Hybrid scenario:
multiple threads contribute to one large message
Partitioned send/recv:
the contributions can be declared/tested
// partition.c
int bufsize = nparts*SIZE;
int *partitions = (int*)malloc((nparts+1)*sizeof(int));
for (int ip=0; ip<=nparts; ip++)
partitions[ip] = ip*SIZE;
if (procno==src) {
double *sendbuffer = (double*)malloc(bufsize*sizeof(double));
MPI_Request send_request;
MPI_Psend_init
(sendbuffer,nparts,SIZE,MPI_DOUBLE,tgt,0,
comm,MPI_INFO_NULL,&send_request);
for (int it=0; it<ITERATIONS; it++) {
MPI_Start(&send_request);
for (int ip=0; ip<nparts; ip++)
fill_buffer(sendbuffer,partitions[ip],partitions[ip+],ip);
MPI_Pready(ip,send_request);
MPI_Wait(&send_request,MPI_STATUS_IGNORE);
}
MPI_Request_free(&send_request);
Use
MPI_Parrived(recv_request,ipart,&flag);
Sessions model
Create a session;
a session has multiple ‘process sets’
from a process set you make a communicator;
Potentially create multiple sessions in one program run
Can not mix objects from multiple simultaneous sessions
// session.c
MPI_Info session_request_info = MPI_INFO_NULL;
MPI_Info_create(&session_request_info);
char thread_key[] = "mpi_thread_support_level";
MPI_Info_set(session_request_info,
thread_key,"MPI_THREAD_MULTIPLE");
// sessionmulti.c
MPI_Info info1 = MPI_INFO_NULL, info2 = MPI_INFO_NULL;
char thread_key[] = "mpi_thread_support_level";
MPI_Info_create(&info1); MPI_Info_create(&info2);
MPI_Info_set(info1,thread_key,"MPI_THREAD_SINGLE");
MPI_Info_set(info2,thread_key,"MPI_THREAD_MULTIPLE");
MPI_Session session1,session2;
MPI_Session_init( info1,MPI_ERRORS_ARE_FATAL,&session1 );
MPI_Session_init( info2,MPI_ERRORS_ARE_FATAL,&session2 );
Library lib1,lib2;
MPI_Init(0,0);
MPI_Comm world = MPI_COMM_WORLD;
int procno,nprocs;
MPI_Comm_rank(world,&procno);
MPI_Comm_size(world,&nprocs);
auto sum1 = lib1.compute(procno);
auto sum2 = lib2.compute(procno+1);
MPI-4: types
While the C API to MPI is usable from C++, it feels very unidiomatic for
that language. Message Passing Layer (MPL) is a modern C++11
interface to MPI. It is both idiomatic and elegant, simplifying many calling
sequences. It is very low overhead.
Basics
where the reduction function needs to be compatible with the type of the
buffer.
comm_world.reduce(lcm<int>(), 0, v, result);
MPL uses a default value for the tag, and it can deduce the type of the
buffer. Sending a scalar becomes:
// sendscalar.cxx
if (comm_world.rank()==0) {
double pi=3.14;
comm_world.send(pi, 1); // send to rank 1
cout << "sent: " << pi << ’\n’;
} else if (comm_world.rank()==1) {
double pi=0;
comm_world.recv(pi, 0); // receive from rank 0
cout << "got : " << pi << ’\n’;
}
This means that the normal sequence of first declaring, and then filling in,
the request variable is not possible.
MPL implementation note: The wait call always returns a
status object; not assigning it means that the destructor is called
Eijkhout: MPI course 134
102. Request pools
You can not declare a pool of a fixed size and assign elements.
// vector.cxx
vector<double>
source(stride*count);
if (procno==sender) {
mpl::strided_vector_layout<double>
newvectortype(count,1,stride);
comm_world.send
(source.data(),newvectortype,the_other);
}
Summary
Appendix: exercises
(Classroom exercise) Each student holds a piece of paper in the right hand
– keep your left hand behind your back – and we want to execute:
Including boundary conditions for first and last process, that becomes the
following program:
1 If you are not the rightmost student, turn to the right and give the
paper to your right neighbor.
2 If you are not the leftmost student, turn to your left and accept the
paper from your left neighbor.
Check that the rank in the row communicator is the column number, and
the other way around.
Run your code on different number of processes, for instance a number of
rows and columns that is a power of 2, or that is a prime number. This is
one occasion where you could use ibrun -np 9; normally you would
never put a processor count on ibrun.
Eijkhout: MPI course 146