MPI Foundations @ TACC

Contact: Victor Eijkhout



  • Socket: the processor chip
  • Processor: we don’t use that word
  • Core: one instruction-stream processing unit
  • Process: preferred terminology in talking about MPI.
  • SPMD model: simple program multiple data


prototype: declarnation


  • Gathering
  • reduction: reduce, n to 1 MPI_Reduce
  • gather: collect, subset to one set
  • Spreading
  • broadcast: identical, 1 to n
  • scatter: subsetting MPI_Scatterv
int MPI_Reduce(
    void* send_data, # use buffer, &x
    void* recv_data,
    int count, # size of x
    MPI_Datatype datatype,
    MPI_Op op,
    int root, # not needed for MPI_Allreduce
    MPI_Comm communicator)

int MPI_Gather(
   void *sendbuf, int sendcnt, MPI_Datatype sendtype,
   void *recvbuf, int recvcnt, MPI_Datatype recvtype,
   int root, MPI_Comm comm)

int MPI_Scatter
   (void* sendbuf, int sendcount, MPI_Datatype sendtype,
    void* recvbuf, int recvcount, MPI_Datatype recvtype,
    int root, MPI_Comm comm)
  • MPI_MAX - Returns the maximum element.
  • MPI_MIN - Returns the minimum element.
  • MPI_SUM - Sums the elements.
  • MPI_PROD - Multiplies all elements.
  • MPI_LAND - Performs a logical and across the elements.
  • MPI_LOR - Performs a logical or across the elements.
  • MPI_BAND - Performs a bitwise and across the bits of the elements.
  • MPI_BOR - Performs a bitwise or across the bits of the elements.
  • MPI_MAXLOC - Returns the maximum value and the rank of the process that owns it.
  • MPI_MINLOC - Returns the minimum value and the rank of the process that owns it.


void pointer: memory address of the data


write &x or (void*) &x for scalar


comm.recv, slow but python object; comm.Recv, fast

Scan or 'parallel prefix'

Barrier: Synchronize procs

  • almost never needed
  • only for timing

Naive realization

  • root to all $\alpha + \beta*n$
  • better, p2p

Distributed data

  • local_index + rank
  • global_index

Load balancing

$f(i) = \floor(iN/p)$ given proc $i$: $[f(i),f(i+1)]$

Local info. exchange

Matrices in parallel: distribute domain, not the matrix

p2p: ping-pong * two side: A & B * match: send & recv

int MPI_Send(
  const void* buf, int count, MPI_Datatype datatype,
  int dest, int tag, MPI_Comm comm)

IN buf: initial address of send buffer (choice)
IN count: number of elements in send buffer (non-negative integer)
IN datatype: datatype of each send buffer element (handle)
IN dest: rank of destination (integer)
IN tag: message tag (integer)
IN comm: communicator (handle)

send & recv

communication across nodes is x100 slower than within nodes


send & recv are blocking operations

  • deadlock
  • might work

Bypass blocking

  • odds & evens: exchange sort, compare-and-swap
  • pairwise exchange


  • isend, irecv

need MPI_Wait as blocking to refresh the buffer; buffer is safe in blocking comm. Latency hiding: don't need to hold until comm is done TEST: non-blacking wait, do local work if no incoming msg by test


  • MPI_Bsend, MPI_Ibsend: buffered send
  • MPI_Ssend, MPI_Issend: synchronous send
  • MPI_Rsend, MPI_Irsend: ready send
  • One-sided communication: ‘just’ put/get the data somewhere
  • Derived data types: send strided/irregular/inhomogeneous data
  • Sub-communicators: work with subsets of MPI_COMM_WORLD
  • I/O: efficient file operations
  • Non-blocking collectives
Published: Fri 14 April 2017. By Dongming Jin in

Comments !