To get a feel for writing MPI programs, we'll focus on the simplest program imaginable, the standard Hello world program. The basic outline of an MPI program follows these general steps:
MPI has over 125 functions. However, a beginning programmer usually can make do with only six of these functions. These six functions are illustrated in our sample program.
First, we'll look at the actual calling formats used by MPI.
For C, the general format is
Note that case is important here. For example, MPI must be capitalized, as must be the first character after the underscore. Everything after that must be lower case. Rc is a return code, and is type integer.
C programs should include the file "mpi.h". This contains definitions for MPI constants and functions.
For Fortran, the general form is
Note that case is not important here. So, an equivalent form would be
Instead of the function returning with an error code, as in C, the Fortran versions of MPI routines usually have one additional parameter in the calling list, ierror, which is the return code.
Fortran programs should include 'mpif.h'. This contains definitions for MPI constants and functions.
The exceptions to the above formats are the timing routines (MPI_WTIME and MPI_WTICK) which are functions for both C and Fortran, and return double-precision reals.
As you look at the code below, note the various calls to MPI routines. Click on the name of each MPI routine to read a detailed description of that routine's purpose and syntax.
program hello
include 'mpif.h'
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
character(12) message
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
tag = 100
if(rank .eq. 0) then
message = 'Hello, world'
do i=1, size-1
call MPI_SEND(message, 12, MPI_CHARACTER, i, tag,
& MPI_COMM_WORLD, ierror)
enddo
else
call MPI_RECV(message, 12, MPI_CHARACTER, 0, tag,
& MPI_COMM_WORLD, status, ierror)
endif
print*, 'node', rank, ':', message
call MPI_FINALIZE(ierror)
end
To summarize the program: This is a SPMD code, so copies of this program are running on multiple nodes. Each process initializes itself with MPI (MPI_INIT), determines the number of processes (MPI_COMM_SIZE), and learns its rank (MPI_COMM_RANK). Then one process (with rank 0) sends messages in a loop (MPI_SEND), setting the destination argument to the loop index to ensure that each of the other processes is sent one message. The remaining processes receive one message (MPI_RECV). All processes then print the message, and exit from MPI (MPI_FINALIZE).
MPI messages consist of two basic parts: the actual data that you want to send/receive, and an envelope of information that helps to route the data. There are usually three calling parameters in MPI message-passing calls that describe the data, and another three parameters that specify the routing:
Message = data(3 parameters) + envelope(3 parameters)
startbuf,count,datatype, dest,tag,comm
\ | / \ | /
\---DATA--/ ENVELOPE
Let's look at the data and envelope in more detail. We'll describe each parameter, and discuss
whether these must be coordinated between the sender and receiver.
The buffer (location in your program where data are to be sent from or stored to) is specified by three calling parameters:
The address where the data start. For example, this could be the start of an array in your program.
The number of elements (items) of data in the message. Note that this is elements, not bytes. This makes for portable code, since you don't have to worry about different representations of data types on different computers. The software implementation of MPI determines the number of bytes automatically.
The count specified by the receive call should be greater than or equal to the count specified by the send. If more data is sent than storage is available in the receive buffer, an error will occur.
The type of data to be transmitted. For example, this could be floating point. The datatype should be the same for the send and receive call. As an aside, an exception to this rule is the datatype MPI_PACKED, which is one method of handling mixed-type messages (the preferred method is with a derived datatypes).
The types of data already defined for you are called "basic datatypes," and are listed below. WARNING: note that the names are slightly different between the C implementation and the Fortran implementation.
MPI Basic Datatypes for C
---------------------------------------
MPI Datatype C datatype
---------------------------------------
MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE
MPI_PACKED
---------------------------------------
MPI Basic Datatypes for Fortran
---------------------------------------
MPI Datatype Fortran Datatype
---------------------------------------
MPI_INTEGER INTEGER
MPI_REAL REAL
MPI_DOUBLE_PRECISION DOUBLE PRECISION
MPI_COMPLEX COMPLEX
MPI_LOGICAL LOGICAL
MPI_CHARACTER CHARACTER(1)
MPI_BYTE
MPI_PACKED
---------------------------------------
As mentioned earlier, a message consists of the actual data and the message envelope. The envelope provides information on how to match sends to receives. The three parameters used to specify the message envelope are:
These arguments are set to a rank in a communicator (see below). Ranks range from 0 to (size-1) where size is the number of processes in the communicator.
Destination is specified by the send and is used to route the message to the appropriate process. Source is specified by the receive. Only messages coming from that source can be accepted by the receive call. The receive can set source to MPI_ANY_SOURCE to indicate that any source is acceptable.
An arbitrary number to help distinguish among messages. The tags specified by the sender and receiver must match. The receiver can specify MPI_ANY_TAG to indicate that any tag is acceptable.
The communicator specified by the send must equal that specified by the receive. We'll describe communicators in more depth later in the module. For now, we'll just say that a communicator defines a communication "universe", and that processes may belong to more than one communicator. In this module, we will only be working with the predefined communicator MPI_COMM_WORLD which includes all processes in the application.
To help understand the message envelope, let's consider the analogy of a bill collection agency that collects for several utilities. When sending a bill the agency must specify:
As promised earlier, we are now going to explain a little more about communicators. We will not go into great detail -- only enough that you will have some understanding of how they are used.
As described above, a message's eligibility to be picked up by a specific receive call depends on its source, tag, and communicator. Tag allows the program to distinguish between types of messages. Source simplifies programming. Instead of having a unique tag for each message, each process sending the same information can use the same tag. But why is a communicator needed?
An example
Suppose you are sending messages between your processes, but you are also calling a set of libraries you obtained elsewhere, which also runs on multiple nodes and communicates within itself using MPI. In this case, you want to make sure that messages you send go to your processes, and do not get confused with the messages being sent internally within the library routines.
In this example, we have three processes communicating with each other. Each process also calls a library routine, and the three library routines communicate with each other. We want to have two different message "spaces" here, one for our messages, and one for the library's messages. We do not want any intermingling of the messages.
The diagram below shows what we would like to happen. The white boxes represent our own routines ("caller"). The shaded boxes represent the library routines ("callee"). Sends and receives are indicated with their destination or source. In this case, everything works as intended.
However, there is no guarantee that things will occur in this order, given the fact that the relative scheduling of processes on different nodes can occur non-deterministically. Suppose, for example, that we change the third process by adding some computation at the beginning. The sequence of events might then occur as follows:
In this case, communications do not occur as intended. The first "receive" in process 0 now receives the "send" from the library routine in process 1, not the intended (and now delayed) "send" from process 2. As a result, all three processes hang.
This problem is solved by the library developer defining a new communicator, and specifying this in all send and receive calls made by the library. This would create a library ("callee") message space separate from the user's ("caller") message space. Tags can't be used to separate the message spaces because they are chosen by the programmer. How can the programmer avoid using the same tags as the library, when they don't know what tags the library uses? Communicators are returned by the system and incorporate a system-wide unique identifier.
In addition to development of parallel libraries, communicators are useful in organizing communication within an application.
Up to this point, we've looked at communicators that include all processes in the application. But the programmer can also define a subset of processes, called a process group, and attach one or more communicators to the process group. Communication specifying that communicator is now restricted to those processes.
This also ties directly into use of collective communications.
To revisit the bill collection analogy: one person may have an account with the electric and phone companies (2 communicators) but not with the water company. The electric communicator may contain different people than the phone communicator. A person's ID number (rank) may vary with the utility (communicator). So, it is critical to note that the rank given as message source or destination is the rank in the specified communicator.
5. Compiling and Running MPI jobs on the IBM SP
These scripts are: mpxlf - Fortran mpxlf options program.f mpcc - C mpcc options program.c mpCC - C++ mpCC options program.C
poe exec -rmpool 1 -procs 4 -euidevice css0 -euilib us
Although MPI provides for an extensive, and sometimes complex, set of calls, you can begin with
just the six basic calls:
However, for programming convenience and optimization of your code, you should consider using
other calls, such as those described in the more advanced references.
MPI Messages
MPI messages consist of two parts:
The data defines the information to be sent or received. The envelope is used in routing messages
to the receiver, and in matching send calls to receive calls.
Communicators
Communicators guarantee unique message spaces. In conjunction with process groups, they can
be used to limit communication to a subset of processes.
Good MPI references are:
Online Example Programs:
This program illustrates a simple domain decomposition (by blocks of rows or columns).
MPI_Send, MPI_Recv
6. Compiling and Running MPI jobs on the Origin
a>
f77 options program.f -lmpi ! for FORTRAN codes
cc options program.c -lmpi ! For C codes
mpirun -np 2 a.out
www.msi.umn.edu /origin/tutorials/intro/intro-content.html#sub_job2
8. Acknowledgements, References and Online Examples
This talk was based on the "Basics of MPI Programming" developed at
the Cornell Theory center and on the tutorial by Ewing Lusk.
by William Gropp, Ewing Lusk, and Anthony Skjellum
This program is a simple, self-scheduling version of matrix/vector multiply.
MPI_Bcast, MPI_Send, MPI_Recv, MPI_Abort, MPI_ANY_TAG
This code contains the complete communications skeleton for a dynamically load balanced master/slave application.
MPI_Send, MPI_Recv