We have been experiencing hangs with our MPI-based application and our investigation led us to observing the following behaviour of mpirun:
mpirun -n 1 -host <good_hostname> hostname works as expected
mpirun -n 1 -host <bad_hostname> hostname hangs, during which ps shows:
21465 pts/11 S+ 0:00 | | | \_ /bin/sh /opt/soft1/intel-mpi/impi/5.0.1.035/intel64/bin/mpirun -n 1 -host bad_hostname hostname
21470 pts/11 S+ 0:00 | | | \_ mpiexec.hydra -n 1 -host bad_hostname hostname
21471 pts/11 Z 0:00 | | | \_ [ssh] <defunct>
Once I press Enter on the terminal from which I ran the mpirun command, the command exits with no output and exit code 141:
$ mpirun -n 1 -host bad_hostname hostname; echo $?
141
$
Tried running it with strace and it seems like the command gets stuck in the following wait4() system call:
...
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f0610228a10) = 19905
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x43e840, [], SA_RESTORER, 0x307a635cd0}, {SIG_DFL, [], SA_RESTORER, 0x307a635cd0}, 8) = 0
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], 0, NULL)
Full mpirun -v and strace output attached.
Tried it with both 4.1.3.049 and 5.0.1.035 and the behaviour is the same.
Any help is much appreciated.