Hello
I have a simple test MPI job that I'm having trouble running on IMPI 4.1.036 with large node counts (>1500 processes). This is using Hydra as the process manager. It gets stuck at the following place in a verbose debug output:
[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 39): barrier_in
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 42): barrier_in
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 48): get_maxes
[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 7): barrier_in
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 12): get_maxes
[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 15): get_maxes
[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 45): barrier_in
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 12): barrier_in
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 48): barrier_in
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 15): barrier_in
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 30): get_maxes
[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 30): barrier_in
[proxy:0:160@cf-sb-cpc-223] forwarding command (cmd=barrier_in) upstream
It runs consistently fine for process counts up to 1536 approx, but 2k or 3k cores breaks as above. I thought initially it might be ulimit related, but fixed those - as proven by the output from the same script:
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515125
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 9316
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 515125
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
If one sets I_PROCESS_MANAGER to mpd, the large jobs work faultlessly. So, my question is how to go about debugging this further?
Many thanks
Ade