In point to point communication, one process sends a message and a second process receives it. This is in contrast to collective communication routines, in which a pattern of communication is established amongst a group of processes. There are several programming options for point to point communication that relate to how the system handles the message, and this tutorial will give you a brief, high-level overview of some of them which are available in MPI.
There are two general aspects to point to point comunication. One refers to the communication mode and the other corresponds to how program control is returned to the user when the send or receive is posted.
Communication modes specify how the system handles the message. For instance, if a send has completed, does that tell us anything about the receive? Can we know if the receive has finished or even begun? Modes specify the underlying protocol for sending or receiving messages, and will be discussed in more detail in another tutorial.
The other aspect to point to point communication focuses on the issue of program control, and is the primary topic which we will discuss here. Program control with point to point communication is enforced by using blocking or nonblocking send/receives. A blocking send "blocks" until the send buffer - the contents of the message in the send call - can be reclaimed. Similarily a blocking receive "blocks" until the receive buffer contains the contents of the message that was sent.
The send call is potentially non-local. It may copy the contents of its buffer directly to the receive buffer or to a temporary system buffer. If it does the former, the send is not completed until the receive is initiated. If it does the latter, the send returns ahead of the matching receive. The choice of which send is implemented is implementation dependent.
Blocking calls contrast with nonblocking class which allow the overlap of message transmittal and computation. Nonblocking calls return control immediately to the user, while a blocking send/receive is one in which control is not immediately returned. The key ideas include:
A blocking or non-blocking send can be paired to a blocking or non-blocking receive.
In both sending and receiving modes, the buffer used to contain the message can be an oft-used resource, and problems arise when it is used before an on-going transaction has completed; blocking communications insure that this never happens -- when control returns from the blocking call, the buffer can safely be modified without any danger of corrupting some other part of the process.
A non-blocking call effectively guarantees that an interrupt will be generated when the transaction is ready to proceed, thus allowing the original thread to get back to computationally-oriented processing.
Deadlock is an often-encountered situation in parallel processing, resulting when two or more processes are in contention for the same set of resources.
In communications, a typical scenario involves two processes wishing to exchange messages, but both trying to give theirs to the other while neither is yet ready to accept one.
A number of strategies will be described to help insure against this sort of thing occurring in the application.
Applications can use specific calls to control how they deal with incomplete transactions, or transactions whose state is not known, without necessarily having to complete the operation (this is analogous to wanting to know whether or not your aunt has written you, but not particularly caring what she had to say). The application can be programmed to be aware of the state of its communications, and thereby act intelligently in different situations:
When involved in non-blocking transactions, a number of calls make it possible for the application to query the status of particular messages, or to check on any that meet a certain set of characteristics, without necessarily taking any completed transactions out of the queue.
Checking the information returned from a transaction allows the application to, for example, take corrective action if an error has occurred.
A generic term meaning "anything meeting a very general set of characteristics." MPI_ANY_SOURCE allows the receiver to get messages from any sender, and MPI_ANY_TAG allows the receiver to get any kind of message from a sender.
Applications often deal with regular data structures, and perform the same kind of communication everywhere within them, except for at the edges, where special code has to be written in order not to communicate where there are no valid "neighbors" to receive, or from whom to receive; special null parameters move the logic for this out of user-code and into the system, simplifying the application.
The biggest advantage to non-blocking sends and receives is that computation may proceed concurrently with the communication. That is computation may be done after the communication was posted, and before it was completed, provided the buffer is not used in the computation.
A blocking send or receive call suspends execution of the program until the message buffer being sent/received is safe to use. In the case of a blocking send, this means that the data to be sent have been copied out of the send buffer, but they have not necessarily been received in the receiving task. The contents of the send buffer can be modified without affecting the message that was sent. Completion of a blocking receive implies that the data in the receive buffer are valid.
Non-blocking calls return immediately after initiating the communication. The programmer does not know at this point whether the data to be sent have been copied out of the send buffer, or whether the data to be received have arrived. So, before using the message buffer, the programmer must check its status.
The programmer can choose to block until the message buffer is safe to use (MPI_WAIT and variants (S) or to just return the current status of the communication (MPI_TEST and variants (S)).
The different variants of the Wait and Test calls allow you to check the status of a specific message, or to check all, any, or some of a list of messages.
It is fairly intuitive why you need to check the status of a non-blocking receive: you do not want to read the message buffer until you are sure that the message has arrived. It is less obvious why you would need to check the status of a non-blocking send. This is most necessary when you have a loop that repeatedly fills the message buffer and sends the message. You can't write anything new into that buffer until you know for sure that the preceding message has been successfully copied out of the buffer. Even if a send buffer is not re-used, it is advantageous to complete the communication, as this releases system resources (the request object, described on the next page).
2.1 Syntax of Non-blocking Calls
The nonblocking calls have the same syntax
(S)
as the blocking ones, with two exceptions:
For example, the standard non-blocking send and a corresponding
non-blocking receive call look like this:
MPI_Isend
(buf,count,dtype,dest,tag,comm,request)
MPI_Irecv
(buf,count,dtype,dest,tag,comm,request)
MPI_Isend
(buf,count,dtype,dest,tag,comm,request,ierror)
MPI_Irecv
(buf,count,dtype,source,tag,comm,request,ierror)
Note that the Wait and Test calls take one or more request handles as
input and return one or more status arrays (or structure in C).
The request handle lets the user identify communication operations
and links the posting operation (eg. MPI_ISEND) with the completion
operation (eg. MPI_WAIT). The status array (structure in C) has
information about the success or failure of the communication and
other miscellaneous information.
Wait, Test, and status are discussed in detail later in this talk.
2.2 Basic Deadlock
Deadlock is a phenomenon most common with blocking
communication. It occurs when all tasks are waiting for events
that haven't been initiated yet. The following diagram
represents two SPMD tasks: both are calling blocking standard
sends at the same point of the program. Each task's matching
receive occurs later in the other task's program.
There are four ways to avoid deadlock:
Arrange for one task to post its receive first and for the
other to post its send first. That clearly establishes
that the message in one direction will precede the other.
Have each task post a non-blocking receive before it does
any other communication. This allows each message to be
received, no matter what the task is working on when the
message arrives or in what order the sends are posted.
MPI_Sendrecv_replace
Use MPI_Sendrecv
(S)
The send-receive combines in one call the sending
of a message to a destination and the receiving
from a source, and is particularly useful when a
node both sends and receives a message. It is an
elegant solution to the deadlock problem because
the "send half" of the send-receive on node i
can complete before the "receive half" on node j
is initiated. In the _replace
(S)
version, the same buffer is used for both the send
and receive, so the message that was sent is replaced
by the received message. The system implements the
send-receive-replace by allocating some buffer space (not
subject to the threshold limit) to deal with the exchange
of messages.
Use buffered sends so that computation can proceed after
copying the message to the user-supplied buffer. This
will allow the send to complete and the subsequent receive
to be executed.
Buffered sends are discussed in the follow-on to this
module, More
Point to Point Communication with MPI.
Non-blocking calls have the advantage that computation can
continue almost immediately, even if the message can't be
sent. This can eliminate deadlock and reduce
synchronization overhead.
On some machines, the system overhead can be reduced if
the message transport can be handled in the background
without having any impact on the computations in progress
on both the sender and receiver. This is not currently
true for the SP.
A knowledgeable source
has the following comments regarding whether or not
non-blocking calls do in fact result in a reduction of
system overhead:
Some additional programming is required with non-blocking calls,
to test for completion of the communication. It is best to
post sends and receives as early as possible, and to wait for
completion as late as possible. "Early as possible" above means
that the data in the buffer to be sent must be valid, and
likewise the buffer to be received into must exist.
Do not write to send buffer between
MPI_Isend
and MPI_WAIT
and must avoid reading and writing in receive
buffer between MPI_Irecv
and MPI_WAIT
It should be possible to safely read the send buffer after the
send is posted, but nothing should be written to that buffer
until after status has been checked to give assurance that the
original message has been sent. Otherwise, the original message
contents could be overwritten. NO user reading or writing of the
receive buffer should take place between posting a non-blocking
receive and determining that the message has been received.
The read might give either old data or new (incoming) message
data. A write could overwrite the recently arrived message.
3.1 Wait, Test, and Probe
Although it is often useful to decouple non-blocking communications
from the computational thread, it is also useful to recouple them.
By doing so, one can
obtain information about the status of the communication transaction,
and then take actions depending on this status.
The three calls described here are among the most commonly
useful for dealing with these kinds of situations.
Blocking communications involve an automatic wait,
so you'll never see a non-trivial call to it when such
operations are used. In both the send and receive
for non-blocking operations, the calling process suspends
its operation until the operation referenced by the
wait has completed, at
which time execution resumes in the calling process.
On the receive side, the process has already posted a
non-blocking receive, which will be completed
regardless of what the calling process does. The
programmer is therefore able to determine whether or not
any useful computation can be accomplished
before the information in the not-yet-received message is
required. If there is useful work available, the
application has clearly gained efficiency by doing that
work while the message is still in transit. At some
point, of course, the message will be needed, and a
wait will be issued.
On the sending side, the process is also freed from the
transaction, except for the fact that it is constrained
from doing any more communication with that particular
buffer until its current use is completed. If the
sender attempts to put some other message into that buffer
before the preceding transmission completes, the results
are indeterminate and based solely on what part of the
process was in train when the overwriting occurred. Doing
a wait on the message guarantees that re-use will
not be destructive.
While MPI_WAIT suspends execution until an operation
completes, MPI_TEST returns immediately with
information about its current status.
MPI_TEST will return true only in the
case that the sender specified in the object has sent
a message which is currently in the queue for delivery;
traffic from all other sources is ignored.
On the sender side, test is the non-blocking analog
to wait, giving the application knowledge of the
current state without requiring it to block until
completion, thus allowing the application to do other work,
if any exists.
In Fortran these are
returned as an array of integers. status(MPI_SOURCE), status(MPI_TAG), and status(MPI_ERROR)
give the source,tag, and error condition of the incoming message, respectively.
In C they are returned as a structure of type MPI_STATUS. The fields
status.MPI_SOURCE, status.MPI_TAG, and
status.MPI_ERROR.
The status array (Fortran) or structure (C) must be declared in the user's
program.
IBM's MPI implementation varies from the MPI standard
regarding the status information by making available the
total number of bytes involved in the operation. The standard
mechanism for obtaining this information is to call MPI_Get_count
(see below). Our trusty unnamed
knowledgeable source at CTC comments:
This can be useful if you've allocated a receive buffer
that may be larger than the incoming message, or if you want
to learn the length of a message identified with MPI_Probe
(S).
The information made available by status comes in very
handy in the following situations:
Wildcard is a concept lifted from certain card games relating to
an entity's ability to take on any legal value; in this message
passing context, that ability comes in most useful when dealing with
how a receiver determines which messages will be accepted for
delivery:
Many times it is useful to specify "dummy" source or destination
values for communication.
MPI_PROC_NULL can be used instead of rank whenever source
or destination is used in a communication call.
A communication with MPI_PROC_NULL has no effect. A send returns
as soon as possible. A receive returns with no change to the
input buffer. The source is set to MPI_PROC_NULL, the
tag is set to MPI_ANY_TAG, and count equals 0.
MPI_PROC_NULL is most useful for taking boundary-tests out of user
code. For instance,
if the action desired in such situations is simply to not
send the message, then the programmer can have this
accomplished automatically by specifying one of these null
values for such circumstances, and, when the send or receive
call determines that a null parameter has been given to it, the
operation immediately returns as if it had completed
successfully. This has the same effect as saying "In general,
pass this to your neighbor", while tacitly knowing that you
really mean "...as long as you have a neighbor",
but such is a well-understood case, and you don't feel the need
to explicitly mention it.
This will allow computation and communication
to occur concurrently. This is important because
sending data between processors is much slower
than manipulating data on a processor, so to prevent
a program being "starved for data", using non-blocking calls
allow communication to be initiated while other operations
are performed.
The possibility of
deadlock is eliminated and synchronization overhead
is reduced.
If your application requires that cooperating processes
must be kept in a more-or-less strict lockstep, then
non-blocking calls become less useful and more
cumbersome. Besides, synchronization is what the
blocking calls are intended to provide.
The action of these two paired operations is exactly
that of a blocking call, so why not simply use
it?
If you choose to use blocking transactions, try to
guarantee that deadlock will be avoided by carefully tailoring
your communication strategy so that sends and receives are
properly paired and in the necessary order; alternatively,
post non-blocking receives as early as possible, so that the
sends will stay in the system for as little time as is
necessary.
Correctly knowing the state of communications transactions
allows the application to intelligently steer itself to better
efficiency in its use of available cycles. Ultimately
pending traffic must be accepted, but that action can be long
in coming and much can possibly be accomplished while it is
incomplete. The wait, test and probe calls
allow the application to match the appropriate activity with
the particular situation.
Don't just assume that things are running smoothly -- make it
a general rule that every transaction is checked for success,
and that failures are promptly reported and as much related
information as possible is developed and made available for
debugging.
Using general receives, receives capable of handling more than
one kind of message traffic (in terms of either sender, or
message-type, or both), can greatly simplify the structure of
your application, and can potentially save on system resources
(if you are in the habit of using a unique message buffer for
each transaction).
This tutorial was strongly based on copyrighted
material from the Cornell Theory Center and was used
with their permission.
Additional tutorials on point-to-point communications can be found
by clicking here
Franke, H. (1995) MPI-F, An MPI Implementation for
IBM SP-1/SP-2. Version 1.41 5/2/95. Available in
postscript, under
http://www.tc.cornell.edu/UserDoc/Software/PTools/mpi/
Franke, H., Wu, C.E., Riviere, M., Pattnaik, P. and Snir,
M. (1995) MPI Programming Environment for IBM
SP1/SP2. Proceedings of ICDCS 95, Vancouver, 1995.
Available in postscript, under
http://www.tc.cornell.edu/UserDoc/Software/PTools/mpi/
Gropp, W., Lusk, E. and Skjellum, A. (1994) Using
MPI. Portable Parallel Programming with the Message-Passing
Interface. The MIT Press. Cambridge, Massachusetts.
MPI: A Message-Passing Interface Standard June 1995
Message Passing Interface Forum (1995) MPI: A
Message Passing Interface Standard. June 12, 1995.
Available in postscript from
http://www.epm.ornl.gov/~walker/mpi/
RS/6000 SP: Practical MPI Progr
amming by Yukiya Aoyama and Jun Nakano.
and an example of a Wait, (MPI_WAIT), call will look like this:
2.3 Summary: Non-blocking Calls
...I have the strong suspicion that
non-blocking calls can actually incur more overhead
than blocking ones that result in immediate message
transfer. This extra overhead comes from several sources:
(a) the cost of allocating a request object; (b) the
overhead of doing the interrupt when the data are
transferred later; and (c) the cost of querying to
determine whether the transfer has completed. Cost (a)
can be eliminated by using persistent requests (definitely
an advanced topic). All these can be small compared to
the synchronization overhead that is avoided if useful
computations are available to be done.
Once you've gotten some experience with the basics of MPI,
consult the standard or available texts for information on
persistent requests; the rest of the comment simply
points out that a blocking call carries with it
much less systems-baggage than does a non-blocking
call, if you can assume that both are going to be
satisfied immediately, but if the transaction is
not immediately satisfied, then the non-blocking
call wins to the extent that "useful computation" can be
accomplished prior to its conclusion.
Although the standard doesn't specify where the
number of bytes in the message appear in the status object, it
does specify that the status must contain enough information
so the MPI_Get_count can do its thing. Thus, it effectively
mandates that the number of bytes will be there, just not
where. The IBM MPI does the obvious -- it puts the message
size in bytes in the first optional field.
4.2 Null Parameters