MPI Tutorial
MPI Tutorial
MPI Tutorial
Purpose
Other Tutorials
MPI Programming Model
Purpose
This tutorial will help you install and use MPI.NET, a .NET library that enables the creation of
high-performance parallel applications that can be deployed on multi-threaded workstations and
Windows clusters. MPI.NET provides access to the Message Passing Interface (MPI) in C# and
all of the other .NET languages. MPI is a standard for message-passing programs that is widely
implemented and used for high-performance parallel programs that execute on clusters and
supercomputers.
By the end of this tutorial, you should be able to:
Other Tutorials
This tutorial is written for online reading and uses C# for all of its examples. Please see the main
documentation page for tutorials formatted for printing/offline reading and tutorials using other
languages (e.g., Python).
that is available locally (within that process's local memory), communicating with the other
processes in the parallel program at the boundaries of the data. For example, consider a simple
program that computes the sum of all of the elements in an array. The sequential program would
loop through the array summing all of the values to produce a result. In a SPMD parallel
program, the array would be broken up into several different pieces (one per process), and each
process would sum the values in its local array (using the same code that the sequential program
would have used). Then, the processes in the parallel program would communicate to combine
their local sums into a global sum for the array.
MPI supports the SPMD model by allowing the user to easily launch the same program across
many different machines (nodes) with a single command. Initially, each of the processes are
identical, with one distinguishing characteristic: each process is assigned a rank, which uniquely
identifies that process. The ranks of MPI processes are integer values from 0 to P-1, where P is
the number of processes launched as part of the MPI program. MPI processes can query their
rank, allowing different processes in the MPI program to have different behavior, and exchange
messages with other processes in the same job via their ranks.
Prerequisites
Install MPI.NET SDK
Ping? Pong! Running an MPI program
Prerequisites
To develop parallel programs using MPI.NET, you will need several other tools. Note that you
do not need to have a Windows cluster or even a multi-core/multi-processor workstation to
develop MPI programs: any desktop machine that can run Windows XP can be used to develop
MPI programs with MPI.NET.
Microsoft Visual Studio 2005 (or newer), including Microsoft Visual C#: we will be
writing all of our examples in C#, although MPI.NET can be used from any .NET
language.
MS-MPI: MPI.NET is built on Microsoft's implementation of the Message Passing
Interface. There are actually two different ways to get MS-MPI (you only need to do one
of these):
o HPC Pack 2008 SDK or Microsoft Compute Cluster Pack SDK: includes MSMPI and the various headers that one needs if writing MPI programs in C or C++
without MPI.NET. Recommended for most users, because it installs on Windows
XP and can be used to develop MPI.NET programs.
o Windows HPC Server 2008 or Microsoft Compute Cluster Server 2003: provides
support for deploying MPI.NET applications on Windows clusters. However, for
development purposes it is generally best to use one of the SDKs mentioned
above.
Windows Installer: most Windows users will already have this program, which is used to
install programs on Microsoft Windows.
Here, we have executed the MPI program PingPong with a single process, which will always be
assigned rank 0. The process has displayed the name of the computer it is running on (our
computer is named "jeltz") and returned. When you run this program, you might get a warning
from Windows Firewall like the following, because PingPong is initiating network
communications. Just select "Unblock" to let the program execute.
To make PingPong more interesting, we will instruct MPI to execute 8 separate processes in the
PingPong job, all coordinating via MPI. Each of the processes will execute the PingPong
program, but because they have different MPI ranks (integers 0 through 7, inclusive), each will
act slightly differently. To run MPI programs with multiple processes, we use the mpiexec
program provided by the Microsoft Compute Cluster Pack or its SDK, as follows (all on a single
command line):
C:\Program Files\MPI.NET>"C:\Program Files\Microsoft Compute Cluster
Pack\Bin\mpiexec.exe" -n 8 PingPong.exe
Rank 0 is alive and running on jeltz
Pinging process with rank 1... Pong!
Rank 1 is alive and running on jeltz
Pinging process with rank 2... Pong!
Rank 2 is alive and running on jeltz
Pinging process with rank 3... Pong!
Rank 3 is alive and running on jeltz
Pinging process with rank 4... Pong!
Rank 4 is alive and running on jeltz
Pinging process with rank 5... Pong!
Rank 5 is alive and running on jeltz
Pinging process with rank 6... Pong!
Rank 6 is alive and running on jeltz
Pinging process with rank 7... Pong!
Rank 7 is alive and running on jeltz
That's it! The mpiexec program launched 8 separate processes that are working together as a
single MPI program. The -n 8 argument instructs mpiexec to start 8 processes that will all
communicate via MPI (you can specify any number of processes here). In the PingPong
program, the process with rank 0 will send a "ping" to each of the other processes, and report the
name of the computer running that process back to the user. Since we haven't told mpiexec to
run on different computers, all 8 processes are running on our workstation; however, the same
program would work even if the 8 processes were running on different machines.
If PingPong ran correctly on your system, your installation of MPI.NET is complete and you're
ready to move on the programming in MPI.NET. If you are going to be developing and running
MPI programs, you will probably want to add the Compute Cluster Pack's Bin directory to your
PATH environment variable, so that you can run mpiexec directly. From now on, we're going to
assume that you have done so and will just use mpiexec in our examples without providing its
path.
Once you've created your project, you need to add a reference to the MPI.NET assembly in
Visual Studio. This will allow your program to use MPI.NET's facilities, and will also give you
on-line help for MPI.NET's classes and functions. In the Solution Explorer, right click on
"References" and select "Add Reference...":
Next, scroll down to select the "Message Passing Interface" item from the list of components
under the .NET tab, then click "OK" to add a reference to the MPI.NET assembly.
The entirety of an MPI program should be contained within the using statement, which
guarantees that the MPI environment will be properly finalized (via
MPI.Communicator.Dispose) before the program exits. All valid MPI programs must both
initialize and finalize the MPI environment. We pass in a reference to our command-line
arguments, args, because MPI implementations are permitted to use special command-line
arguments to pass state information in to the MPI initialization routines (although few MPI
implementations actually do this). In theory, MPI could remove some MPI-specific arguments
from args, but in practice args will be untouched.
Now that we have the MPI environment initialized, we can write a simple program that prints out
a string from each process. Inside the using statement, add the line:
Console.WriteLine("Hello, World! from rank " + Communicator.world.Rank
+ " (running on " + MPI.Environment.ProcessorName + ")");
Each MPI process will execute this code independently (and currently), and each will likely
produce slightly different results. For example, MPI.Environment.ProcessorName returns the
name of the computer on which a process is running, which could differ from one MPI process to
the next (if we're running our program on a cluster). Similarly, we're printing out the rank of
each process via Communicator.world.Rank. We'll talk about communicators a bit more later.
MPIHello.exe
on jeltz)
on jeltz)
on jeltz)
on jeltz)
on jeltz)
on jeltz)
on jeltz)
on jeltz)
Notice that we have 8 different lines of output, one for each of the 8 MPI processes we started as
part of our MPI program. Each will output it rank (from 0 to 7) and the name of the processor or
machine it is running on. The output you receive from running this program will be slightly
different from the output shown here, and will probably differ from one invocation to the next.
Since the processes are running concurrently, we don't know in what order the processes will
finish the call to WriteLine and write that output to the screen. To actually enforce some
ordering, the processes would have to communicate.
MPI Communicators
10
11
Point-to-Point Communications
Ring Around the Network
Data Types and Serialization
Point-to-Point Communication
Point-to-point communication is the most basic form of communication in MPI, allowing a
program to send a message from one process to another over a given communicator. Each
message has a source and target process (identified by their ranks within the communicator), an
integral tag that identifies the kind of message, and a payload containing arbitrary data. Tags will
be discussed in more detail later.
There are two kinds of communication for sending and receiving messages via MPI.NET's pointto-point facilities, blocking and non-blocking. The blocking point-to-point operations will wait
until a communication has completed on its local processor before continuing. For example, a
blocking Send operation will not return until the message has entered into MPI's internal buffers
to be transmitted, while a blocking Receive operation will wait until a message has been
received and completely decoded before returning. MPI.NET's non-blocking point-to-point
operations, on the other hand, will initiate a communication without waiting for that
communication to be completed. Instead, a Request object, which can be used to query,
complete, or cancel the communication, will be returned. For our initial examples, we will use
blocking communication.
12
To implement our ring-communication application, we start with the typical skeleton of an MPI
program, and give ourselves an easy way to access the world communicator (via the variable
comm). Then, since we have decided that process 0 will initiate the message, we give rank 0 a
different code path from the other processes in the MPI program.
using System;
using MPI;
class Ring
{
static void Main(string[] args)
{
using (new MPI.Environment(ref args))
{
Intracommunicator comm = Communicator.world;
if (comm.Rank == 0)
{
// program for rank 0
}
else // not rank 0
{
// program for all other ranks
}
}
}
}
This pattern of giving one of the processes (which is often called the "root", and is typically rank
0) a slightly different code path than all of the other processes is relatively common in MPI
programs, which often need to perform some coordination or interaction with the user.
Rank 0 will be responsible for initiating the communication, by sending a message to rank 1. The
code below initiates a (blocking) send of a piece of data. The three parameters to the Send
routine are, in order:
13
The data to be transmitted with the message. In this case, we're sending the string
"Rosie".
The rank of the destination process within the communicator. In this case, we're sending
the message to rank 1. (We are therefore assuming that this program is going to run with
more than one process!)
The tag of the message, which will be used by the receiver to distinguish this message
from other kinds of messages. We'll just use tag 0, since there is only one kind of
message in our program.
if (comm.Rank == 0)
{
// program for rank 0
comm.Send("Rosie", 1, 0);
// receive the final message
}
Now that we have initiated the message, we need to write code for each of the other processes.
These processes will wait until they receive a message from their predecessor, print the message,
then send a message on to their successor.
else // not rank 0
{
// program for all other ranks
string msg = comm.Receive<string>(comm.Rank - 1, 0);
Console.WriteLine("Rank " + comm.Rank + " received message \"" + msg +
"\".");
comm.Send(msg + ", " + comm.Rank, (comm.Rank + 1) % comm.Size, 0);
}
The Receive call in this example states that we will be receiving a string from the processor with
rank comm.Rank - 1 (our predecessor in the ring) and tag 0. This receive will match any
message sent from that rank on tag zero; if that message does not contain a string, the program
will fail. However, since the only Send operations in our program send strings with tag 0, we will
not have a problem. Once a process has received a string from its successor, it will print that to
the console and send another message on to its successor in the ring. This Send operation is
much like rank 0's Send operation: most importantly, it sends a string over tag 0. Note that each
process will add its own rank to the message string, so that we get an idea of the path that the
message took.
Finally, we return to the special-case code for rank 0. When the last process in the ring finally
sends its result back to rank 0, we will need to receive that result. The receive for rank 0 is
similar to the receive for all of the other processes, although here we use the special value
Communicator.anySource for the "source" process of the receive. anySource allows the
Receive operation to match a message with the appropriate tag, regardless of which rank sent
the message. The corresponding value for the tag argument, Communicator.anyTag, allows a
Receive to match a message with any tag.
14
if (comm.Rank == 0)
{
// program for rank 0
comm.Send("Rosie", 1, 0);
// receive the final message
string msg = comm.Receive<string>(Communicator.anySource, 0);
Console.WriteLine("Rank " + comm.Rank + " received message \"" + msg +
"\".");
}
We can now go ahead and compile this program, then run it with 8 processes to mimic the
communication ring in the figure at the beginning of this section:
C:\Ring\bin\Debug>mpiexec -n 8 Ring.exe
Rank 1 received message "Rosie".
Rank 2 received message "Rosie, 1".
Rank 3 received message "Rosie, 1, 2".
Rank 4 received message "Rosie, 1, 2, 3".
Rank 5 received message "Rosie, 1, 2, 3, 4".
Rank 6 received message "Rosie, 1, 2, 3, 4, 5".
Rank 7 received message "Rosie, 1, 2, 3, 4, 5, 6".
Rank 0 received message "Rosie, 1, 2, 3, 4, 5, 6, 7".
In theory, even though the processes are each printing their respective messages in order, it is
possible that the lines in the output could be printed in a different order (or even produce some
unreadable interleaving of characters), because each of the MPI processes has its own "console",
all of which are forwarded back to your command prompt. For simple MPI programs, however,
writing to the console often suffices.
At this point, we have completed our "ring" example, which passes a message around a ring of
two or more processes and print the results. Now, we'll take a quick look at what kind of data can
be transmitted via MPI.
C# structures with public visibility. For example, the following Point structure:
public struct Point
{
public float x;
public float y;
}
Serializable Classes
Any class that is serializable. A class can be made serializable by attaching the
Serializable attribute, as shown below; for more information, see Object Serialization
using C#.
[Serializable]
public class Employee
{
// ...
}
As mentioned before, MPI.NET transmits different data types in different ways. While most of
the details of value transmission are irrelevant to MPI users, there is a significant distinction
between the way that .NET value types are transmitted from the way that reference types are
transmitted. The differences between value types and reference types are discussed in some
detail in .NET: Type Fundamentals. For MPI.NET, value types, which include primitive types
and structures, are always transmitted in a single message, and provide the best performance for
message-passing applications. Reference types, on the other hand, always need to be serialized
(because they refer to objects on the heap) and (typically) are split into several messages for
transmission. Both of these operations make the transmission of reference types significantly
slower than value types. However, reference types are often necessary for complicated data
structures, and provide one other benefit: unlike with value types, which require the data types at
the sender and receive to match exactly, one can send an object for a derived class and receive it
via its base class, simplifying some programming tasks.
MPI.NET's point-to-point operations also provide support for arrays. As with transmitting
objects, arrays are transmitted in different ways depending on whether the element type of the
array is a value type or a reference type. In both cases, however, when you are receiving an array
you must provide an array with at least as many elements as the sender has sent. Note that we
provide the array to receive into as our last argument to Receive, using the ref keyword to
denote that the routine will modify the array directly (rather than allocating a new array). For
example:
if (comm.Rank == 0)
{
int[] values = new int [5];
comm.Send(values, 1, 0);
}
else if (comm.Rank == 1)
{
int[] values = new int [10];
comm.Receive(0, 0, ref values); // okay: array of 10 integers has enough
space to receive 5 integers
}
16
MPI.NET can transmit most kinds of data types used in C# and .NET programs. The most
important rule with sending and receiving messages, however, is that the data types provided by
the sender and receiver must match directly (for value types) or have a derived-base relationship
(for reference types).
17
Collective Communication
Barrier: Marching Computations
All-to-one: Gathering Data
One-to-all: Spreading the Message
All-to-all: Something for Everyone
Combining Results with Parallel Reduction
Collective Communication
Collective communication provides a more structured alternative to point-to-point
communication. With collective communications, all of the processes within a communicator
collaborate on a single communication operation that fits one of several common communication
patterns used in message-passing applications. Collective operations include simple barriers,
one-to-all, all-to-one, and all-to-all communications, and parallel reduction operations that
combine the values provided by each of the processes in the communication.
Although it is possible to express parallel programs entirely through point-to-point operations
(some even call send and receive the "assembly language" of distributed-memory parallel
programming), collectives provide several advantages for writing parallel programs. For these
reasons, it is generally preferred to use collectives whenever possible, falling back to point-topoint operations when no suitable collective exists.
Code Readability/Maintainability
It is often easier to write and reason about programs that use collective communication
than the equivalent program using point-to-point communication. Collectives express the
intent of a communication better (e.g., a Scatter operation is clearly distributing data
from one process to all of the other processes), and there are often far fewer collective
operations needed to accomplish a task than point-to-point messages (e.g., a single all-toall operation instead of N2 point-to-point operations), making it easier to debug programs
using collectives.
Performance
MPI implementations typically contain optimized algorithms for collective operations
that take advantage of knowledge of the network topology and hardware, even taking
advantage of hardware-based implementations of some collective operations. These
optimizations are hard to implement directly over point-to-point, without the knowledge
already available in the MPI implementation itself. Therefore, using collective operations
can help improve the performance of parallel programs and make that performance more
portable to other clusters with different configurations.
In an MPI program, the various processes perform their local computations without regard to the
behavior of the other processes in the program, except when the processes are waiting for some
inter-process communication to complete. In many parallel programs, all of the processes work
more-or-less independently, but we want to make sure that all of the processes are on the same
step at the same time. The Barrier collective operation is used for precisely this operation.
When processes enter the barrier, they do not exit the barrier until all processes have entered the
barrier. Place barriers before or after a step of the computation that all processes need to perform
at the same time.
In the example program below, each of the iterations of the loop is completely synchronized, so
that every process is on the same iteration at the same time.
using System;
using MPI;
class Barrier
{
static void Main(string[] args)
{
using (new MPI.Environment(ref args))
{
Intracommunicator comm = Communicator.world;
for (int i = 1; i <= 5; ++i)
{
comm.Barrier();
if (comm.Rank == 0)
Console.WriteLine("Everyone is on step " + i + ".");
}
}
}
}
Executing this program with any number of processes will produce the following output (here,
we use 8 processes).
C:\Barrier\bin\Debug>mpiexec -n 8 Barrier.exe
Everyone is on step 1.
Everyone is on step 2.
Everyone is on step 3.
Everyone is on step 4.
Everyone is on step 5.
19
using System;
using MPI;
class Hostnames
{
static void Main(string[] args)
{
using (new MPI.Environment(ref args))
{
Intracommunicator comm = Communicator.world;
string[] hostnames = comm.Gather(MPI.Environment.ProcessorName,
0);
if (comm.Rank == 0)
{
Array.Sort(hostnames);
foreach(string host in hostnames)
Console.WriteLine(host);
}
}
}
}
In the call to Gather, each process provides a value (in this case, the string produced by reading
the ProcessorName property) to the Gather operation, along with the rank of the "root" node
(here, process zero). The Gather operation will return an array of values to the root node, where
the ith value in the array corresponds to the value provided by the process with rank i. All other
processes receive a null array.
To gather all of the data from all of the nodes, use the Allgather collective. Allgather is
similar to Gather, with two differences: first, there is no parameter identifying the "root"
process, and second, all processes receive the same array containing the contributions from every
process. An Allgather is, therefore, the same as a Gather followed by a Broadcast, described
below.
20
The Broadcast operation requires only two arguments; the second, familiar argument is the rank
of the root process, which will supply the value. The first argument contains the value to send (at
the root) or the place in which the received value will be stored (for every process). The pattern
used in this example is quite common for Broadcast: all processes define the same variable, but
only the root process gives it a meaningful value. Then the processes coordinate to broadcast the
root's value to every process, and all processes follow the same code path to handle the data.
The Scatter collective, like Broadcast, broadcasts values from a root process to every other
process. Scatter, however, will broadcast different values to each of the processes, allowing the
root to hand out different tasks to each of the other processes. The root process provides an array
of values, in which the ith value will be sent to the process with rank i. All of the processes then
return their value from the Scatter operation.
When executed with 8 processes, rank 1 will receive an array containing the following strings (in
order):
21
From
From
From
From
From
From
From
From
0
1
2
3
4
5
6
7
to
to
to
to
to
to
to
to
1.
1.
1.
1.
1.
1.
1.
1.
22
When running this program, the more darts you throw, the better the approximation to pi. To
parallelize this program, we'll use MPI to run several processes, each of which will throw darts
independently. Once all of the processes have finished, we'll sum up the results (the total number
of darts that landed inside the circle on all processes) to compute pi. The complete code for the
parallel calculation of pi follows, but the most important line uses Reduce to sum the total
number of darts that landed in the circle across all of the processes:
int totalDartsInCircle = comm.Reduce(dartsInCircle, Operation<int>.Add, 0);
The three arguments to Reduce are the number of darts that landed in the circle locally, a
delegate Operation<int>.Add that sums integers, and the rank of the root process (here, 0).
Any other .NET delegate would also work, e.g.,
public static int AddInts(int x, int y) { return x + y; }
// ...
int totalDartsInCircle = comm.Reduce(dartsInCircle, AddInts, 0);
However, using the MPI.NET Operation class permits better optimizations within MPI.NET.
Without further delay, here is the complete MPI program for computing an approximation to pi
in parallel:
using System;
using MPI;
class Pi
{
static void Main(string[] args)
{
using (new MPI.Environment(ref args))
{
Intracommunicator comm = Communicator.world;
int dartsPerProcessor = 10000;
Random random = new Random(5 * comm.Rank);
int dartsInCircle = 0;
for (int i = 0; i < dartsPerProcessor; ++i)
{
double x = (random.NextDouble() - 0.5) * 2;
double y = (random.NextDouble() - 0.5) * 2;
if (x * x + y * y <= 1.0)
++dartsInCircle;
}
int totalDartsInCircle = comm.Reduce(dartsInCircle,
Operation<int>.Add, 0);
if (comm.Rank == 0)
Console.WriteLine("Pi is approximately {0:F15}.",
4*(double)totalDartsInCircle/(comm.Size*(double)dartsPerProcessor));
}
}
}
23