Collective communication involves sending and receiving data among all the processes in a "communicator universe" (communicators were discussed in the Introduction to MPI tutorial ). MPI's collective communication routines have been built by using point-to-point communication routines and tuned by the vendors for the purpose of effectively dealing with the communication complexity. Certainly, you can build your own collective communication routines, but that could involve a lot of extra work. There is no sense in re-inventing the wheel.
Other message-passing libraries provide some collective communication calls, none of them provides a set of collective communication routines as complete and robust as those provided by MPI. In this page, we introduce the basic characteristics of collective communication, operation routines, and the new information from vendors on their collective communication routines.
MPI collective communication routines differ in many ways from MPI
point-to-point communication routines, which were introduced in
MPI Point-to-Point Communication I.
Here are the characteristics of MPI collective communication routines:
2.1 Characteristics
2.2 Barrier Synchronization Routines
On parallel applications in the distributed memory environment, explicit or
implicit synchronization is sometimes required. As with other
message-passing libraries, MPI provides a function call, MPI_BARRIER,
to synchronize all processes within a communicator. A barrier is
simply a synchronization primitive. A node calling it will be blocked
until all the nodes within the group have called it. The syntax of
MPI_BARRIER for both C and FORTRAN programs is given below:
C
MPI_Barrier |
(MPI_Comm comm) |
where:
MPI_Comm |
is an MPI predefined structure for communicators, and |
comm |
is a communicator. |
MPI_BARRIER |
(comm, ierr) |
where:
comm |
is an integer denoting a communicator, and |
ierr |
is an integer return error code. |
Now, let's take a look at the functionality and syntax of these routines.
int MPI_Bcast |
(void* buffer, int count,
MPI_Datatype datatype, int root,
MPI_Comm comm) |
MPI_BCAST |
(buffer, count, datatype, root, comm, ierr)
|
where:
buffer |
is the starting address of a buffer, |
count |
is an integer indicating the number of data elements in the buffer, |
datatype |
is an MPI defined constant indicating the data type of the elements in the buffer, |
root |
is an integer indicating the rank of broadcast root process, and |
comm |
is the communicator. |
The MPI_BCAST must be called by each node in the group, specifying the
same comm and
root. The message is sent from the root
process to all processes in the group, including the root process.
int MPI_Gather |
(void* sbuf, int scount,
MPI_Datatype stype, void*
rbuf, int rcount,
MPI_Datatype rtype, int root,
MPI_Comm comm ) |
int MPI_Scatter |
(void* sbuf, int scount,
MPI_Datatype stype, void*
rbuf, int rcount, MPI_Datatype
rtype, int root, MPI_Comm
comm) |
MPI_GATHER |
(sbuf, scount, stype, rbuf, rcount, rtype, root,
comm, ierr) |
MPI_SCATTER |
(sbuf, scount, stype, rbuf, rcount, rtype, root,
comm, ierr) |
where, for the Gather routines:
sbuf |
is the starting address of send buffer, |
scount |
is the number of elements in the send buffer, |
stype |
is the data type of send buffer elements, |
rbuf |
is the starting address of the receive buffer, |
rcount |
is the number of elements for any single receive, |
rtype |
is the data type of the receive buffer elements, |
root |
is the rank of receiving process, and |
comm |
is the communicator. |
and for the Scatter routines:
sbuf |
is the address of the send buffer, |
scount |
is the number of elements sent to each process, |
stype |
is the data type of the send buffer elements, |
rbuf |
is the address of the receive buffer, |
rcount |
is the number of elements in the receive buffer, |
rtype |
is the data type of the receive buffer elements, |
root |
is the rank of the sending process, and |
comm |
is the communicator. |
In the gather operation, each process (root process included) sends scount elements of type stype of sbuf to the root process. The root process receives the messages and stores them in rank order in the rbuf. For scatter, the reverse holds. The root process sends a buffer of N chunks of data (N = number of processors in the group) so that processor one gets the first element, processor two gets the second element, etc.
In order to illustrate these functions, we give a graph below:
This picture is augmented by the following example for gather.
DIMENSION A(25,100), b(100), cpart(25), ctotal(100)
INTEGER root
DATA root/0/
DO I=1,25
cpart(I) = 0.
DO K=1,100
cpart(I) = cpart(I) + A(I,K)*b(K)
END DO
END DO
call MPI_GATHER(cpart,25,MPI_REAL,ctotal,25,MPI_REAL,
root, MPI_COMM_WORLD, ierr)
b: vector shared by all processors
c: vector updated by each processor independently
MPI_ALLGATHER can be thought of as MPI_GATHER where all processes, not just the root, receive the result. The jth block of the receive buffer is the block of data sent from the jth process. A similar relationship holds for MPI_ALLGATHERV and MPI_GATHERV. The syntax of MPI_ALLGATHER and MPI_ALLGATHERV are similar to MPI_GATHER and MPI_GATHERV, respectively. However, the argument root is dropped from MPI_ALLGATHER and MPI_ALLGATHERV.
int MPI_Allgather |
(void* sbuf, int scount,
MPI_Datatype stype, void* rbuf,
int rcount, MPI_Datatype
rtype, MPI_Comm comm )
|
int MPI_Allgatherv |
(void* sbuf, int scount,
MPI_Datatype stype, void* rbuf,
int* rcount, int* displs,
MPI_Datatype rtype, MPI_Comm
comm) |
MPI_ALLGATHER |
(sbuf, scount, stype, rbuf, rcount, rtype, comm,
ierr) |
MPI_ALLGATHERV |
(sbuf, scount, stype, rbuf, rcount, displs, rtype,
comm, ierr) |
The variables for Allgather are:
sbuf |
is the starting address of send buffer, |
scount |
is the number of elements in send buffer, |
stype |
is the data type of send buffer elements, |
rbuf |
is the address of receive buffer, |
rcount |
is the number of elements received from any process, |
rtype |
is the data type of receive buffer elements, |
comm |
is the group communicator. |
DIMENSION A(25,100), b(100), cpart(25), ctotal(100)
DO I=1,25
cpart(I) = 0.
DO K=1,100
cpart(I) = cpart(I) + A(I,K)*b(K)
END DO
END DO
call MPI_ALLGATHER(cpart,25,MPI_REAL,ctotal,25,
MPI_REAL, MPI_COMM_WORLD, ierr)
The syntax of MPI_ALLTOALL is:
int MPI_Alltoall |
(void* sbuf, int scount,
MPI_Datatype stype, void*
rbuf, int rcount,
MPI_Datatype rtype, MPI_Comm
comm ) |
MPI_ALLTOALL |
(sbuf, scount, stype, rbuf, rcount, rtype, comm,
ierr) |
The variables for Alltoall are:
sbuf |
is the starting address of send buffer, |
scount |
is the number of elements sent to each process, |
stype |
is the data type of send buffer elements, |
rbuf |
is the address of receive buffer, |
rcount |
is the number of elements received from any process, |
rtype |
is the data type of receive buffer elements, and |
comm |
is the group communicator. |
Note: Same specification as ALLGATHER, except sbuf must contain scount*NPROC elements.
int MPI_Gatherv |
(void* sbuf, int scount,
MPI_Datatype stype, void*
rbuf int *rcount, int*
displs, MPI_Datatype rtype,
int root, MPI_Comm
comm) |
int MPI_Scatterv |
(void* sbuf, int* scount,
int* displa, MPI_Datatype
stype, void* rbuf, int
rcount, MPI_Datatype rtype,
int root, MPI_Comm
comm) |
MPI_GATHERV |
(sbuf, scount, stype, rbuf, rcount, displs,
rtype, root, comm, ierr) |
MPI_SCATTERV |
(sbuf, scount, displs, stype, rbuf, rcount,
rtype, root, comm, ierr) |
The variables for Gatherv are:
sbuf |
is the starting address of send buffer, |
scount |
is the number of elements in send buffer, |
stype |
is the data type of send buffer elements, |
rbuf |
is the starting address of receive buffer, |
rcount |
is the array containing number of elements to be received from each process, |
displs |
is the array specifying the displacement relative to rbuf at which to place the incoming data from corresponding process, |
rtype |
is the data type of receive buffer, |
root |
is the rank of receiving process, and |
comm |
is the group communicator. |
The variables for Scatterv are:
sbuf |
is the address of send buffer, |
scount |
is the integer array specifying the number of elements to send to each process, |
displs |
is the array specifying the displacement relative to sbuf from which to take the data going out to the corresponding process, |
stype |
is the data type of send buffer elements, |
rbuf |
is the address of receive buffer, |
rcount |
is the number of elements in receive buffer, |
rtype |
is the data type of receive buffer elements, and |
root |
is the rank of sending process, and |
comm |
is the group communicator. |
For the purpose of illustrating the usage of MPI_GATHERV and MPI_SCATTERV, we give two Fortran program fragments below:
MPI_ALLGATHERV routine collects individual messages from each task covered by the same communicator and distributes the resulting message to all tasks. Messages can have different sizes and displacements.
int MPI_Allgatherv |
(void* sbuf, int scount,
MPI_Datatype stype, void* rbuf,
int* rcount, int* displs,
MPI_Datatype rtype, MPI_Comm
comm) |
MPI_ALLGATHERV |
(sbuf, scount, stype, rbuf, rcount, displs, rtype,
comm, ierr) |
The parameters for Allgatherv are:
sbuf |
is the starting address of send buffer, |
scount |
is the number of elements in send buffer, |
stype |
is the data type of send buffer elements, |
rbuf |
is the address of receive buffer, |
rcount |
is the number of elements received from any process, |
rtype |
is the data type of receive buffer elements, and |
comm |
is the group communicator. |
MPI_ALLTOALLV adds flexibility to MPI_ALLTOALL in that the location of data for the send is specified by sdispls and the location of the placement of the data on the receive side is specified by rdispls. The jth block sent from process i is received by process j and is placed in the ith block of recvbuf. These blocks need not all have the same size.
The type signature associated with sendcount[j], sendtype at process i must be equal to the type signature associated with recvcount[i], recvtype at process j. This implies that the amount of data sent must be equal to the amount of data received, pairwise between every pair of processes. Distinct type maps between sender and receiver are still allowed.
int MPI_Alltoallv |
(void* sendbuf, int sendcounts,
int sdispls,
MPI_Datatype sendtype, void* recvbuf,
int* recvcount, int* rdispls,
MPI_Datatype rcevtype, MPI_Comm
comm) |
MPI_ALLTOALLV |
(sendbuf, sendcount, sendtype, recvbuf, recvcount,
rdispls, recvtype, comm, ierr) |
With the following variables:
sendbuf |
is the starting address of send buffer (choice), |
sendcounts |
is the integer array equal to the group size specifying the number of elements to send to each processor, |
sdispls |
is the integer array (of length group size). Entry,j specifies the displacement (relative to sendbuf from which to take the outgoing data destined for process j , |
sendtype |
is the data type of send buffer elements (handle), |
recvbuf |
is the address of receive buffer (choice), |
rdispls |
is the integer array (of length group size). Entry i specifies the displacement (relative to recvbuf at which to place the incoming data from process i , |
recvcounts |
is the integer array equal to the group size specifying the number of elements that can be received from each processor, |
recvtype |
is the data type of receive buffer elements (handle), |
comm |
is the communicator, and |
IERROR |
is the Error information handle |
ALLTOALLW is an extension to MPI_ALLTOALLV defined in MPI-2. MPI_ALLTOALLW allows separate specification of count, displacement, and datatype. In addition, to allow maximum flexibility, the displacement of blocks within the send and receive buffers is specified in bytes. The MPI_ALLTOALLW function generalizes several MPI functions by carefully selecting the input arguments. For example, by making all but one process have sendcounts[i] = 0, this achieves an MPI_SCATTERW function.
int MPI_Alltoallw(void *sendbuf, int sendcounts[], int sdispls[], MPI_Datatype s
endtypes[], void *recvbuf, int recvcounts[], int rdispls[],
MPI_Datatype recvtypes[], MPI_Comm comm )MPI_ALLTOALLW |
(sendbuf, sendcounts, sdispls, sendtypes, recvbuf,
recvcounts, rdispls, recvtypes, comm) |
The variables are defined as:
sendbuf |
is the starting address of send buffer (choice), |
sendcounts |
is the integer array equal to the group size specifying the number of elements to send to each processor, |
sdispls |
is the integer array (of length group size). Entry j specifies the displacement in bytes (relative to sendbuf) from which to take the outgoing data destined for process j, |
sendtypes |
is the array of datatypes (of length group size). Entry j specifies the type of data to send to process j (handle), |
recvbuf |
is the address of receive buffer (choice), |
recvcounts] |
is the integer array equal to the group size specifying the number of elements that can be received from each processori, |
rdispls |
is the integer array (of length group size). Entry i specifies the displacement in bytes (relative to recvbuf) at which to place the incoming data from process i , |
recvtypes |
is the array of datatypes (of length group size). Entry i specifies the type of data received from process i (handle) |
comm |
is the communicator (handle). |
int MPI_Reduce |
(void* sbuf, void* rbuf,
int count, MPI_Datatype stype,
MPI_Op op, int root,
MPI_Comm comm) |
int MPI_Allreduce |
(void* sbuf, void* rbuf,
int count, MPI_Datatype stype,
MPI_Op op, MPI_Comm
comm) |
int MPI_Reduce_scatter
|
(void* sbuf, void* rbuf,
int* rcounts, MPI_Datatype stype,
MPI_Op op, MPI_Comm
comm) |
MPI_REDUCE |
(sbuf, rbuf, count, stype, op, root, comm,
ierr) |
MPI_ALLREDUCE |
(sbuf, rbuf, count, stype, op, comm,
ierr) |
MPI_REDUCE_SCATTER |
(sbuf, rbuf, rcounts, stype, op, comm,
ierr) |
The differences among these three reduces:
sbuf |
is the address of send buffer, |
rbuf |
is the address of receive buffer, |
count |
is the number of elements in send buffer, |
stype |
is the data type of elements of send buffer, |
op |
is the reduce operation (which may be MPI predefined, or your own), |
root |
is the rank of the root process, and |
comm |
is the group communicator. |
where * is the reduction function that may be either an MPI predefined
function or a user defined function.
The syntax of the MPI scan routine is:
int MPI_Scan |
(void* sbuf, void* rbuf,
int count, MPI_Datatype datatype,
MPI_Op op, MPI_Comm
comm) |
MPI_SCAN |
(sbuf, rbuf, count, datatype, op, comm,
ierr) |
sbuf |
is the starting address of send buffer, |
rbuf |
is the starting address of receive buffer, |
count |
is the number of elements in input buffer, |
datatype |
is the data type of elements of input buffer, |
op |
is the operation, and |
comm |
is the group communicator. |
Examples of MPI predefined operations are summarized below. MPI also provides a mechanism for user-defined operations used in MPI_ALLREDUCE and MPI_REDUCE.
| Name | Meaning | C type | FORTRAN type |
|---|---|---|---|
| MPI_MAX | maximum | integer, float | integer, real, complex |
| MPI_MIN | minimum | integer, float | integer, real, complex |
| MPI_SUM | sum | integer, float | integer, real, complex |
| MPI_PROD | product | integer, float | integer, real, complex |
| MPI_LAND | logical and | integer | logical |
| MPI_BAND | bit-wise and | integer, MPI_BYTE | integer, MPI_BYTE |
| MPI_LOR | logical or | integer | logical |
| MPI_BOR | bit-wise or | integer, MPI_BYTE | integer, MPI_BYTE |
| MPI_LXOR | logical xor | integer | logical |
| MPI_BXOR | bit-wise xor | integer, MPI_BYTE | integer, MPI_BYTE |
| MPI_MAXLOC | max value and location | combination of int, float, double, and long double | combination of integer, real, complex, double precision |
| MPI_MINLOC | min value and location | combination of int, float, double, and long double | combination of integer, real, complex, double precision |
INTEGER maxht, globmx
.
.
. (calculations which determine maximum height)
.
.
call MPI_REDUCE (maxht, globmx, 1, MPI_INTEGER, MPI_MAX, 0, MPI_COMM_WORLD, ierr)
IF (taskid.eq.0) then
.
. (Write output)
.
END IF
Amount of data transferred: (N-1)*p
Amount of data transferred: log2(N) * N * p/2
On the IBM SP, the new release of PSSP version
3.1.1 allows the processes of an MPI job running
on the same node to take advantage of its shared memory architecture for
communication. To use this feature, one needs to set the environment parameter
MP_SHARED_MEMORY before launching the job, i.e.,:
IBM has started implementing their own non-blocking routines of collective
communication to improve the performance. Their names are distinguished from
the standard routines of collective communication by an additional letter "I"
after the prefix "MPI_." For example, corresponding to MPI_Bcast(...) and
MPI_Gather(...), the non-blocking routines are MPI_Ibcast(...) and MPI_Igather(...).
On the Origin 2000, SGI Message Passing Toolkit
(MPT 1.3.0.3) has been installed with significant enhancement for the
collective communication routines: MPI_Barrier, MPI_Bcast, and MPI_Alltoall.
MPT (1.3.0.3) supports both MPI and SHMEM message passing interface modules
that FORTRAN programmers can use without any changes to portable MPI codes.
To activate the compile-time interface checking, use the f90 compiler
and the following command-line options:
The material in this tutorial borrows heavily from copyrighted material
developed at
the Cornell Theory Center and is used with their permission.
Broadcast to 8 Processors
Solid line: data transfer
Dotted line: carry-over from previous transfer
N = number of processors
Number of steps: log2(N)
p = size of message
Scatter to 8 Processors
Solid line: data transfer
Dotted line: carry-over from previous transfer
N = number of processors
Number of steps: log2(N)
p = size of message
3. New Information from the Vendors
setenv MP_SHARED_MEMORY yes
-auto_use mpi_interface, shmem_interface
4. References