All,
(Note: I'm also asking this on the slurm-dev list.)
I'm hoping you can help me with a question. Namely, I'm on a cluster that uses SLURM and lets say I ask for 2 28-core Haswell nodes to run interactively and I get them. Great, so my environment now has things like:
SLURM_NTASKS_PER_NODE=28 SLURM_TASKS_PER_NODE=28(x2) SLURM_JOB_CPUS_PER_NODE=28(x2) SLURM_CPUS_ON_NODE=28
Now, let's run a simple HelloWorld on, say, 48 processors (and pipe through sort to see things a bit better):
(1047) $ mpirun -np 48 -print-rank-map ./helloWorld.exe | sort -k2 -g srun.slurm: cluster configuration lacks support for cpu binding (borgj102:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27) (borgj105:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47) Process 0 of 48 is on borgj102 Process 1 of 48 is on borgj102 Process 2 of 48 is on borgj102 Process 3 of 48 is on borgj102 Process 4 of 48 is on borgj102 Process 5 of 48 is on borgj102 Process 6 of 48 is on borgj102 Process 7 of 48 is on borgj102 Process 8 of 48 is on borgj102 Process 9 of 48 is on borgj102 Process 10 of 48 is on borgj102 Process 11 of 48 is on borgj102 Process 12 of 48 is on borgj102 Process 13 of 48 is on borgj102 Process 14 of 48 is on borgj102 Process 15 of 48 is on borgj102 Process 16 of 48 is on borgj102 Process 17 of 48 is on borgj102 Process 18 of 48 is on borgj102 Process 19 of 48 is on borgj102 Process 20 of 48 is on borgj102 Process 21 of 48 is on borgj102 Process 22 of 48 is on borgj102 Process 23 of 48 is on borgj102 Process 24 of 48 is on borgj102 Process 25 of 48 is on borgj102 Process 26 of 48 is on borgj102 Process 27 of 48 is on borgj102 Process 28 of 48 is on borgj105 Process 29 of 48 is on borgj105 Process 30 of 48 is on borgj105 Process 31 of 48 is on borgj105 Process 32 of 48 is on borgj105 Process 33 of 48 is on borgj105 Process 34 of 48 is on borgj105 Process 35 of 48 is on borgj105 Process 36 of 48 is on borgj105 Process 37 of 48 is on borgj105 Process 38 of 48 is on borgj105 Process 39 of 48 is on borgj105 Process 40 of 48 is on borgj105 Process 41 of 48 is on borgj105 Process 42 of 48 is on borgj105 Process 43 of 48 is on borgj105 Process 44 of 48 is on borgj105 Process 45 of 48 is on borgj105 Process 46 of 48 is on borgj105 Process 47 of 48 is on borgj105
As you can see, the first 28 processes are on node 1, and the last 20 are on node 2. Okay. Now, I want to do some load balancing, so I want 24 on each. In the past, I always used -perhost and it worked, but now:
(1048) $ mpirun -np 48 -perhost 24 -print-rank-map ./helloWorld.exe | sort -k2 -g srun.slurm: cluster configuration lacks support for cpu binding (borgj102:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27) (borgj105:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47) Process 0 of 48 is on borgj102 Process 1 of 48 is on borgj102 Process 2 of 48 is on borgj102 Process 3 of 48 is on borgj102 Process 4 of 48 is on borgj102 Process 5 of 48 is on borgj102 Process 6 of 48 is on borgj102 Process 7 of 48 is on borgj102 Process 8 of 48 is on borgj102 Process 9 of 48 is on borgj102 Process 10 of 48 is on borgj102 Process 11 of 48 is on borgj102 Process 12 of 48 is on borgj102 Process 13 of 48 is on borgj102 Process 14 of 48 is on borgj102 Process 15 of 48 is on borgj102 Process 16 of 48 is on borgj102 Process 17 of 48 is on borgj102 Process 18 of 48 is on borgj102 Process 19 of 48 is on borgj102 Process 20 of 48 is on borgj102 Process 21 of 48 is on borgj102 Process 22 of 48 is on borgj102 Process 23 of 48 is on borgj102 Process 24 of 48 is on borgj102 Process 25 of 48 is on borgj102 Process 26 of 48 is on borgj102 Process 27 of 48 is on borgj102 Process 28 of 48 is on borgj105 Process 29 of 48 is on borgj105 Process 30 of 48 is on borgj105 Process 31 of 48 is on borgj105 Process 32 of 48 is on borgj105 Process 33 of 48 is on borgj105 Process 34 of 48 is on borgj105 Process 35 of 48 is on borgj105 Process 36 of 48 is on borgj105 Process 37 of 48 is on borgj105 Process 38 of 48 is on borgj105 Process 39 of 48 is on borgj105 Process 40 of 48 is on borgj105 Process 41 of 48 is on borgj105 Process 42 of 48 is on borgj105 Process 43 of 48 is on borgj105 Process 44 of 48 is on borgj105 Process 45 of 48 is on borgj105 Process 46 of 48 is on borgj105 Process 47 of 48 is on borgj105
Huh. No change and still 28,20. Do you know if there is a way to "override" what appears to be SLURM beating the -perhost flag? I suppose there is that srun.slurm warning being thrown, but that usually is a warning for more "tasks-per-core" sort of manipulations.
Thanks,
Matt