Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Large jobs with Hydra won't start

$
0
0

Hello

I have a simple test MPI job that I'm having trouble running on IMPI 4.1.036 with large node counts (>1500 processes).  This is using Hydra as the process manager.  It gets stuck at the following place in a verbose debug output:

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 39): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 42): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 48): get_maxes

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 7): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 12): get_maxes

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 15): get_maxes

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 45): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 12): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 48): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 15): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 30): get_maxes

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 30): barrier_in

[proxy:0:160@cf-sb-cpc-223] forwarding command (cmd=barrier_in) upstream

It runs consistently fine for process counts up to 1536 approx, but 2k or 3k cores breaks as above.  I thought initially it might be ulimit related, but fixed those - as proven by the output from the same script:

core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515125
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 9316
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 515125
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

If one sets I_PROCESS_MANAGER to mpd, the large jobs work faultlessly.  So, my question is how to go about debugging this further?

Many thanks

Ade


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>