Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

MPI_Comm_dup may hang with Intel MPI 4.1

$
0
0

Hi,

The attached program simple_repro.c reproduces what I believe is a bug in the Intel MPI implementation version 4.1.

In short, what it does is it spawns <num_threads> threads on 2 processes, such that thread i on rank 0 is supposed to communicate with thread i on rank 1 using their private communicator. The only difference between the 2 processes involved is that the threads on rank 0 are coordinated with a semaphore, such that they can't all be active at the same time. Threads on rank 1 run freely.

The problem is that if the communication between a pair of threads involves creating a child communicator via MPI_Comm_dup(), it is very likely that they will run into a deadlock situation, where <sem_value> pairs of threads are stuck in (comm_dup, comm_dup) and <num_threads> - <sem_value> pairs of threads are stuck in (sem_wait, comm_dup). See attached stack traces. This sounds to me like a starvation problem.

$ mpigcc -mt_mpi -O3 -Dnum_threads=4 -Dnum_reps=10 -Dsem_value=1 simple_repro.c -o simple_repro
$ mpirun -n 2 `pwd`/simple_repro
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name          Pin cpu
[0] MPI startup(): 0       24808    localhost          {0,1,4,5}
[0] MPI startup(): 1       24809    localhost          {2,3,6,7}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 2
...HANGS...

The exact same program never hangs with impi 5.0, not even with high values for <num_threads> and <num_reps>.

Can anybody confirm this is a library issue that has been fixed in version 5.0?
Thank you!

 


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>