Large jobs with Hydra won't start

Hello

I have a simple test MPI job that I'm having trouble running on IMPI 4.1.036 with large node counts (>1500 processes). This is using Hydra as the process manager. It gets stuck at the following place in a verbose debug output:

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 39): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 42): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 48): get_maxes

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 7): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 12): get_maxes

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 15): get_maxes

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 45): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 12): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 48): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 15): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 30): get_maxes

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 30): barrier_in

[proxy:0:160@cf-sb-cpc-223] forwarding command (cmd=barrier_in) upstream

It runs consistently fine for process counts up to 1536 approx, but 2k or 3k cores breaks as above. I thought initially it might be ulimit related, but fixed those - as proven by the output from the same script:

core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515125
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 9316
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 515125
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

If one sets I_PROCESS_MANAGER to mpd, the large jobs work faultlessly. So, my question is how to go about debugging this further?

Many thanks

Ade

Large jobs with Hydra won't start

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List