Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

mpirun with bad hostname hangs with [ssh] until Enter is pressed

$
0
0

We have been experiencing hangs with our MPI-based application and our investigation led us to observing the following behaviour of mpirun:

mpirun -n 1 -host <good_hostname> hostname works as expected

mpirun -n 1 -host <bad_hostname> hostname hangs, during which ps shows: 

21465 pts/11   S+     0:00  |   |   |   \_ /bin/sh /opt/soft1/intel-mpi/impi/5.0.1.035/intel64/bin/mpirun -n 1 -host bad_hostname hostname
21470 pts/11   S+     0:00  |   |   |       \_ mpiexec.hydra -n 1 -host bad_hostname hostname
21471 pts/11   Z      0:00  |   |   |           \_ [ssh] <defunct>

Once I press Enter on the terminal from which I ran the mpirun command, the command exits with no output and exit code 141:

$ mpirun -n 1 -host bad_hostname hostname; echo $?

141
$

Tried running it with strace and it seems like the command gets stuck in the following wait4() system call:

...

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f0610228a10) = 19905
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x43e840, [], SA_RESTORER, 0x307a635cd0}, {SIG_DFL, [], SA_RESTORER, 0x307a635cd0}, 8) = 0
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], 0, NULL)

 

Full mpirun -v and strace output attached.
Tried it with both 4.1.3.049 and 5.0.1.035 and the behaviour is the same.

Any help is much appreciated.

Fichier attachéTaille
Téléchargermpirun_-v.txt8.51 Ko
Téléchargerstrace.txt18.68 Ko

Viewing all articles
Browse latest Browse all 927

Trending Articles