Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

MPI_Mprobe() makes no progress for internode communicator

$
0
0

Hi all,

My understanding (correct me if I'm wrong), is that MPI_Mprobe() has to guarantee progress if a matching send has been posted. The minimal working example below, however, runs to completion on a single Phi node of stampede2, while deadlocking on more than one node.

Thanks,
Toby

impi version:
Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)

mwe.c (attached)

slurm-mwe-stampede2-two-nodes.sh
~~~
#!/bin/sh
#SBATCH -J mwe # Job name
#SBATCH -p development # Queue (development or normal)
#SBATCH -N 2 # Number of nodes
#SBATCH --tasks-per-node 1 # Number of tasks per node
#SBATCH -t 00:01:00 # Time limit hrs:min:sec
#SBATCH -o mwe-%j.out # Standard output and error log
~~~

mwe-341107.out
~~~
TACC: Starting up job 341107
TACC: Starting parallel tasks...
[0]: post Isend
[1]: post Isend
slurmstepd: error: *** JOB 341107 ON c455-084 CANCELLED AT 2017-10-16T10:59:26 DUE TO TIME LIMIT ***
[mpiexec@c455-084.stampede2.tacc.utexas.edu] control_cb (../../pm/pmiserv/pmiserv_cb.c:857): assert (!closed) failed
[mpiexec@c455-084.stampede2.tacc.utexas.edu] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@c455-084.stampede2.tacc.utexas.edu] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@c455-084.stampede2.tacc.utexas.edu] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion
~~~

AttachmentSize
Downloadtext/x-csrcmwe.c1.03 KB

Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>