MPI w/ DAPL user ** beware **
In our open source finite element code, we have encountered a simple
manager-worker code section than fails randomly while moving arrays (blocks)
of double precision data from worker ranks to manager rank 0.
The failures occur (consistently) with DAPL but never with tcp over IB (IPoIB).
After much effort, the culprit was found to be the memory registration
cache feature in DAPL.
This feature/bug is ON by default ** even though ** the manual states:
"The cache substantially increases performance, but may lead
to correctness issues in certain situations."
From: Intel® MPI Library for Linux OS Developer Reference (2017). pg 95
Once we set this option OFF, the code runs successfully for all test cases over
large and small numbers of cluster nodes. The DAPL performance is still
at least 2x better than IPoIB.
export I_MPI_DAPL_TRANSLATION_CACHE=0
Recommendation to Intel MPI group:
Set I_MPI_DAPL_TRANSLATION_CACHE=0 as the DEFAULT. Encourage developers
to explore setting this option ON ** if ** their code works properly
with OFF.
Specifics:
- Intel ifort 17.0.2
- Intel MPI 17.0.2
- Ohio Supercomputer Center, Owens Cluster.
RedHat 7.3
Mellanox EDR (100Gbps) Infiniband
Broadwell/Haswell cluster nodes.
Code section that randomly fails:
-> Blocks are ALLOCATEd with variable size in a Fortran
derived type (itself also allocated to the number of blocks).
All blocks on rank 0 created before this code below is entered.
sync worker ranks to this point
if rank = 0 then
loop sequentially over all blocks to be moved
if rank 0 owns block -> next block
send worker who owns block the block number (MPI_SEND)
receive block from worker (MPI_RECV)
end loop
loop over all workers
send block = 0 to signal we are done moving blocks
end loop
else ! worker code
loop
post MPI_RECV to get a block number
if block number = 0 -> done
if worker does not own this block, the manager made an error !
send root the entire block -> MPI_SEND
end loop
end if