MPI Foundations @ TACC

Res:

Terminology

Socket: the processor chip
Processor: we don’t use that word
Core: one instruction-stream processing unit
Process: preferred terminology in talking about MPI.
SPMD model: simple program multiple data

MPI

prototype: declarnation

Collectives

Gathering
reduction: reduce, n to 1 MPI_Reduce
gather: collect, subset to one set
Spreading
broadcast: identical, 1 to n
scatter: subsetting MPI_Scatterv

int MPI_Reduce(
    void* send_data, # use buffer, &x
    void* recv_data,
    int count, # size of x
    MPI_Datatype datatype,
    MPI_Op op,
    int root, # not needed for MPI_Allreduce
    MPI_Comm communicator)

int MPI_Gather(
   void *sendbuf, int sendcnt, MPI_Datatype sendtype,
   void *recvbuf, int recvcnt, MPI_Datatype recvtype,
   int root, MPI_Comm comm)

int MPI_Scatter
   (void* sendbuf, int sendcount, MPI_Datatype sendtype,
    void* recvbuf, int recvcount, MPI_Datatype recvtype,
    int root, MPI_Comm comm)

MPI_MAX - Returns the maximum element.

MPI_MIN - Returns the minimum element.

MPI_SUM - Sums the elements.

MPI_PROD - Multiplies all elements.

MPI_LAND - Performs a logical and across the elements.

MPI_LOR - Performs a logical or across the elements.

MPI_BAND - Performs a bitwise and across the bits of the elements.

MPI_BOR - Performs a bitwise or across the bits of the elements.

MPI_MAXLOC - Returns the maximum value and the rank of the process that owns it.

MPI_MINLOC - Returns the minimum value and the rank of the process that owns it.

buffer

void pointer: memory address of the data

C

write &x or (void*) &x for scalar

Python

comm.recv, slow but python object; comm.Recv, fast

find max random number with Reduce

Scan or 'parallel prefix'

Barrier: Synchronize procs

almost never needed
only for timing

Naive realization

root to all $\alpha + \beta*n$
better, p2p

Distributed data

local_index + rank
global_index

Load balancing

$f(i) = \floor(iN/p)$ given proc $i$: $[f(i),f(i+1)]$

Local info. exchange

Matrices in parallel: distribute domain, not the matrix

p2p: ping-pong * two side: A & B * match: send & recv

int MPI_Send(
  const void* buf, int count, MPI_Datatype datatype,
  int dest, int tag, MPI_Comm comm)

Semantics:
IN buf: initial address of send buffer (choice)
IN count: number of elements in send buffer (non-negative integer)
IN datatype: datatype of each send buffer element (handle)
IN dest: rank of destination (integer)
IN tag: message tag (integer)
IN comm: communicator (handle)

send & recv

communication across nodes is x100 slower than within nodes

Blocking

send & recv are blocking operations

deadlock

receive
send

might work

send
receive

Bypass blocking

odds & evens: exchange sort, compare-and-swap
pairwise exchange

Non-blocking

isend, irecv

need MPI_Wait as blocking to refresh the buffer; buffer is safe in blocking comm. Latency hiding: don't need to hold until comm is done TEST: non-blacking wait, do local work if no incoming msg by test