In many message-passing libraries, such as PVM or MPL, the method by which the system handles messages has been chosen by the library developer. The chosen method gives acceptable reliability and performance for all possible communication scenarios. But it may hide possible programming problems or may not give the best performance in specialized circumstances. For MPI, this is equivalent to standard mode communication, which you have been introduced to in Basics of MPI Programming and MPI Point to Point Communication I.
In MPI, more control over how the system handles the message has been given to the programmer, who selects a communication mode when they select a send routine. In addition to standard mode, MPI provides synchronous, ready, and buffered modes. This module will look at the system behavior for each mode, and discuss their advantages and disadvantages.
The communication mode is selected with the send routine. There
are four blocking send routines and four non-blocking send routines,
corresponding to the four communication modes. The receive routine
does not specify communication mode -- it is simply blocking or
non-blocking.
We'll start by examining the behavior of blocking communication for
the four modes, beginning with synchronous mode. For compactness,
we'll delay examination of non-blocking behavior until Section 3.
In the diagram below, time increases from left to right. The heavy
horizontal line marked S represents execution time of the
sending task node, and the heavy dashed line marked R
represents execution time of the receiving task (on a second node).
Breaks in these lines represent interruptions due to the
message-passing event.
When the blocking synchronous send MPI_Ssend
(S)
is executed, the sending task sends the receiving task a "ready to
send" message. When the receiver executes the receive call, it
sends a "ready to receive" message. The data are then transferred.
There are two sources of overhead in message-passing.
System overhead is incurred from copying the message
data from the sender's message buffer onto the network, and from
copying the message data from the network into the receiver's message
buffer.
Synchronization overhead is the time spent waiting
for an event to occur on another task. The
sender must wait for the receive to be executed and for the handshake
to arrive before the message can be transferred. The receiver also
incurs some synchronization overhead in waiting for the handshake to
complete. Synchronization overhead can be significant, not
surprisingly, in synchronous mode. As we shall see, the other modes
try different strategies for reducing this overhead.
Only one relative timing for the MPI_Ssend
(S)
and MPI_Recv
(S)
calls is shown, but they can come in either order. If the receive
call precedes the send, most of the synchronization overhead will be
incurred by the receiver.
One might hope that, if workload is properly load balanced,
synchronization overhead would be minimal on both the sending and
receiving task. This is not realistic on the SP. If nothing else
causes lack of synchronization, UNIX daemons which run at
unpredictable times on the various nodes will cause unsynchronized
delays. One might respond to this by saying that it would be simpler
to just call MPI_Barrier frequently to keep the tasks in sync, but
that call itself incurs synchronization overhead and doesn't assure
that the tasks will be in sync a few seconds later. Thus, barrier
calls are almost always a waste of time.
The ready mode send MPI_Rsend
(S)
simply sends the message out over the network. It requires that the
"ready to receive" notification has arrived, indicating that the
receiving task has posted the receive. If the "ready to receive"
message hasn't arrived, the ready mode send will incur an error. By
default, the code will exit. The programmer can associate a different
error handler with a communicator to override this default behavior.
The diagram shows the latest posting of the MPI_Recv
(S)
that would not cause an error.
Ready mode aims to minimize system overhead and synchronization
overhead incurred by the sending task. In the blocking case, the only
wait on the sending node is until all data have been transferred out
of the sending task's message buffer. The receive can still incur
substantial synchronization overhead, depending on how much earlier it
is executed than the corresponding send.
This mode should not be used unless the user is certain that the
corresponding receive has been posted.
The blocking buffered send MPI_Bsend
(S)
copies the data from the message buffer to a user-supplied buffer, and
then returns. The sending task can then proceed with calculations
that modify the original message buffer, knowing that these
modifications will not be reflected in the data actually sent. The
data will be copied from the user-supplied buffer over the network
once the "ready to receive" notification has arrived.
Buffered mode incurs extra system overhead, because of the
additional copy from the message buffer to the user-supplied buffer.
Synchronization overhead is eliminated on the sending task -- the
timing of the receive is now irrelevant to the sender.
Synchronization overhead can still be incurred by the receiving task.
Whenever the receive is executed before the send, it must wait for the
message to arrive before it can return.
Another benefit for the user is the opportunity to provide the
amount of buffer space for outgoing messages that the program needs.
On the downside, the user is responsible for managing and attaching
this buffer space. A buffered mode send that requires more buffer
space than is available will generate an error, and (by default) the
program will exit.
For IBM's MPI implementation, buffered mode is actually a little
more complicated than depicted in the diagram, as a receive side
system buffer may also be involved. This system buffer is discussed
along with standard mode.
For a buffered mode send, the user must provide the buffer: it can
be a statically allocated array, or memory for the buffer can be
dynamically allocated with malloc. The amount of memory allocated for
the user-supplied buffer should exceed the sum of the message data, as
message headers must also be stored. The parameter
MPI_BSEND_OVERHEAD gives the
bytes needed for each message for pointers and envelope information.
This space must be identified as the user-supplied buffer by a call
to MPI_Buffer_attach
(S)
. When it is no longer needed, it should be detached with
MPI_Buffer_detach
(S)
. There can only be one user-supplied message buffer active at a
time. It will store multiple messages. The system keeps track of
when messages ultimately leave the buffer, and will reuse buffer
space. For a program to be safe, it should not depend on this
happening.
2.1 Blocking Behavior
2.1.1 Blocking Synchronous Send
2.1.2 Blocking Ready Send
2.1.3 Blocking Buffered Send
2.1.4 Buffer Management
2.1.5 Blocking Standard Send
For standard mode, the library implementor specifies the system behavior that will work best for most users on the target system. For IBM's MPI, there are two scenarios, depending on whether the message size is greater or smaller than a threshold value (called the eager limit). The eager limit depends on the number of tasks in the application:
| Number of Tasks | Eager Limit (bytes) = threshold |
|---|---|
| 1 - 16 | 4096 |
| 17 - 32 | 2048 |
| 33 - 64 | 1024 |
| 65 - 128 | 512 |
The behavior when the message size is less than or equal to the threshold is shown below:
In this case, the blocking standard send MPI_Send (S) copies the message over the network into a system buffer on the receiving node. The standard send then returns, and the sending task can continue computation. The system buffer is attached when the program is started -- the user does not need to manage it in any way. There is one system buffer per task that will hold multiple messages. The message will be copied from the system buffer to the receiving task when the receive call is executed.
As with buffered mode, use of a buffer decreases the likelihood of synchronization overhead on the sending task at the price of increased system overhead resulting from the extra copy to the buffer. As always, synchronization overhead can be incurred by the receiving task if a receive is posted first.
Unlike buffered mode, the sending task will not incur an error if the buffer space is exceeded. Instead the sending task will block until the receiving task calls a receive that pulls data out of the system buffer. Thus, synchronization overhead can still be incurred by the sending task.
When the message size is greater than the threshold, the behavior of the blocking standard send MPI_Send (S) is essentially the same as for synchronous mode.
Why does standard mode behavior differ with message size? Small messages benefit from the decreased chance of synchronization overhead resulting from use of the system buffer. However, as message size increases, the cost of copying to the buffer increases, and it ultimately becomes impossible to provide enough system buffer space. Thus, standard mode tries to provide the best compromise.
You have now seen the system behavior for all four communication modes.
The concept of deadlock was introduced in MPI Point to Point Communication I. It is easier to understand the events that lead to deadlock once you understand the behavior of the various communication modes.
The following diagram represents two SPMD tasks: both are calling blocking standard sends at the same point of the program. Each task's matching receive occurs later in the other task's program.
If the message size is greater than the threshold, deadlock will occur because neither task can synchronize with its matching receive. If the message size is less than or equal to the threshold, deadlock can still occur if insufficient system buffer space is available. Both tasks will be waiting for a receive to draw message data out of the system buffer, but these receives cannot be executed because both tasks are blocked at the send.
Synchronous mode is the "safest", and therefore also the most portable. "Safe" means that if a code runs under one set of conditions (i.e. message sizes) it will run under all conditions. Synchronous mode is safe because it does not depend upon the order in which the send and receive are executed (unlike ready mode) or the amount of buffer space (unlike buffered mode and standard mode). Synchronous mode can incur substantial synchronization overhead.
Ready mode has the lowest total overhead. It does not require a handshake between sender and receiver (like synchronous mode) or an extra copy to a buffer (like buffered or standard mode). However, the receive must precede the send. This mode will not be appropriate for all messages.
Buffered mode decouples the sender from the receiver. This eliminates synchronization overhead on the sending task and ensures that the order of execution of the send and receive does not matter (unlike ready mode). An additional advantage is that the programmer can control the size of messages to be buffered, and the total amount of buffer space. There is additional system overhead incurred by the copy to the buffer.
Standard mode behavior is implementation-specific. The library developer choses a system behavior that provides good performance and reasonable safety. For IBM's MPI, small messages are buffered (to avoid synchronization overhead) and large messages are sent synchronously (to minimize system overhead and required buffer space).
Blocking and non-blocking routines were introduced in MPI Point to Point Communication I
.A blocking send or receive call suspends execution of the program until the message buffer being sent/received is safe to use. Non-blocking calls return immediately after initiating the communication. The programmer does not know at this point whether data to be sent have been copied out of the send buffer, or whether data to be received have arrived. So, before using the message buffer, the programmer must check its status.
We have seen the blocking behavior for each of the communication modes. We will now discuss the non-blocking behavior for standard mode. The behaviors of the other modes can be implied from this.
The non-blocking standard send MPI_Isend (S) and non-blocking receive MPI_Irecv (S) . As before, the standard mode send will proceed differently depending on the message size.
The sending task posts the non-blocking standard send when the message buffer contents are ready to be transmitted. It returns immediately without waiting for the copy to the remote system buffer to complete. MPI_Wait (S) is called just before the sending task needs to overwrite the message buffer.
The receiving task calls a non-blocking receive as soon as a message buffer is available to hold the message. The non-blocking receive returns without waiting for the message to arrive. The receiving task calls MPI_Wait (S) when it needs to use the incoming message data (i.e. needs to be certain that it has arrived).
The system overhead will not differ substantially from the blocking send and receive calls unless data transfer and computation can occur simultaneously. Since the SP node CPU must perform both the data transfer and the computation, computation will be interrupted on both the sending and receiving nodes to pass the message. When the interruption occurs should not be of any particular consequence to the program that is running. Even for architectures that overlap computation and communication, the fact that this case applies only to small messages means that no great difference in performance would be expected.
The advantage of using the non-blocking send occurs when the system buffer is full. In this case, a blocking send would have to wait until the receiving task pulled some message data out of the buffer. If a non-blocking call is used, computation can be done during this interval.
The advantage of a non-blocking receive over a blocking one can be considerable if the receive is posted before the send. The task can continue computing until the Wait is posted, rather than sitting idle. This reduces the amount of synchronization overhead.
Non-blocking calls can ensure that deadlock will not result. The
Wait must be posted after the calls needed to complete the
communication.
3.2 Behavior of Non-blocking Calls, cont'd
The case of a non-blocking standard send MPI_Isend (S) for a message larger than the threshold is more interesting:
For a blocking send, the synchronization overhead would be the period between the blocking call and the copy over the network. For a non-blocking call, the synchronization overhead is reduced by the amount of time between the non-blocking call and the MPI_Wait (S) , in which useful computation is proceeding.
Again, the non-blocking receive MPI_Irecv (S) will reduce synchronization overhead on the receiving task for the case in which the receive is posted first. There is also a benefit to using a non-blocking receive when the send is posted first. Typically, blocking receives are posted immediately before the message data must be used (to allow the maximum amount of time for the communication to complete). So, the blocking receive would be posted in place of the MPI_Wait. This would delay the synchronization with the send call until this later point in the program, and thus increase synchronization overhead on the sending task.
On some machines, non-blocking calls can reduce the system overhead. This requires that the message transport can be handled in the background without having any impact on the computations in progress. This is not currently true for the SP.
The greater advantage of non-blocking communication is in reducing synchronization overhead and eliminating deadlock. Synchronization overhead for the sender can be reduced for synchronous mode and standard mode (ready and buffered mode don't accumulate send-side synchronization overhead). Non-blocking communication can reduce synchronization overhead for the receiver regardless of the communication mode.
It is best to post non-blocking sends and receives as early as possible, and call Wait as late as possible. "Early as possible" means that the data in the buffer to be sent must be valid, and likewise the buffer to be received into must not be in use. Keep in mind that tasks may still need to synchronize on the Wait, if the message has not yet been transferred. You want to maximize the time the task is computing in the hope that, by the time the task reaches the Wait, the communication will have completed.
4. Programming Recommendations
In general, it is reasonable to start programming with non-blocking calls and standard mode. Non-blocking calls can eliminate the possibility of deadlock and reduce synchronization overhead. Standard mode gives generally good performance.
Blocking calls may be required if the programmer wishes the tasks to synchronize. Also, if the program requires a non-blocking call to be immediately followed by a Wait, it is more efficient to use a blocking call. If using blocking calls, it may be advantageous to start in synchronous mode, and then switch to standard mode. Testing in synchronous mode will ensure that the program does not depend on the presence of sufficient system buffering space.
The next step is to analyze the code and evaluate its performance. If non-blocking receives are posted early, well in advance of the corresponding sends, it might be advantageous to use ready mode. In this case, the receiving task should probably notify the sender after the receives have been posted. After receiving the notification, the sender can proceed with the sends.
If there is too much synchronization overhead on the sending task, especially for large messages, buffered mode may be more efficient. For IBM's MPI, an alternative would be to use the run-time flag to change the threshold message size at which system behavior switches from buffered to synchronous.
This tutorial was strongly based on copyrighted material developed at the Cornell Theory Center and is used here with their permission.