Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Intel MPI, perhost, and SLURM: Can I override SLURM?

$
0
0

All,

(Note: I'm also asking this on the slurm-dev list.)

I'm hoping you can help me with a question. Namely, I'm on a cluster that uses SLURM and lets say I ask for 2 28-core Haswell nodes to run interactively and I get them. Great, so my environment now has things like:

SLURM_NTASKS_PER_NODE=28
SLURM_TASKS_PER_NODE=28(x2)
SLURM_JOB_CPUS_PER_NODE=28(x2)
SLURM_CPUS_ON_NODE=28

Now, let's run a simple HelloWorld on, say, 48 processors (and pipe through sort to see things a bit better):

(1047) $ mpirun -np 48 -print-rank-map ./helloWorld.exe | sort -k2 -g
srun.slurm: cluster configuration lacks support for cpu binding
(borgj102:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
(borgj105:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)
Process    0 of   48 is on borgj102
Process    1 of   48 is on borgj102
Process    2 of   48 is on borgj102
Process    3 of   48 is on borgj102
Process    4 of   48 is on borgj102
Process    5 of   48 is on borgj102
Process    6 of   48 is on borgj102
Process    7 of   48 is on borgj102
Process    8 of   48 is on borgj102
Process    9 of   48 is on borgj102
Process   10 of   48 is on borgj102
Process   11 of   48 is on borgj102
Process   12 of   48 is on borgj102
Process   13 of   48 is on borgj102
Process   14 of   48 is on borgj102
Process   15 of   48 is on borgj102
Process   16 of   48 is on borgj102
Process   17 of   48 is on borgj102
Process   18 of   48 is on borgj102
Process   19 of   48 is on borgj102
Process   20 of   48 is on borgj102
Process   21 of   48 is on borgj102
Process   22 of   48 is on borgj102
Process   23 of   48 is on borgj102
Process   24 of   48 is on borgj102
Process   25 of   48 is on borgj102
Process   26 of   48 is on borgj102
Process   27 of   48 is on borgj102
Process   28 of   48 is on borgj105
Process   29 of   48 is on borgj105
Process   30 of   48 is on borgj105
Process   31 of   48 is on borgj105
Process   32 of   48 is on borgj105
Process   33 of   48 is on borgj105
Process   34 of   48 is on borgj105
Process   35 of   48 is on borgj105
Process   36 of   48 is on borgj105
Process   37 of   48 is on borgj105
Process   38 of   48 is on borgj105
Process   39 of   48 is on borgj105
Process   40 of   48 is on borgj105
Process   41 of   48 is on borgj105
Process   42 of   48 is on borgj105
Process   43 of   48 is on borgj105
Process   44 of   48 is on borgj105
Process   45 of   48 is on borgj105
Process   46 of   48 is on borgj105
Process   47 of   48 is on borgj105

As you can see, the first 28 processes are on node 1, and the last 20 are on node 2. Okay. Now, I want to do some load balancing, so I want 24 on each. In the past, I always used -perhost and it worked, but now:

(1048) $ mpirun -np 48 -perhost 24 -print-rank-map ./helloWorld.exe | sort -k2 -g
srun.slurm: cluster configuration lacks support for cpu binding
(borgj102:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
(borgj105:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)
Process    0 of   48 is on borgj102
Process    1 of   48 is on borgj102
Process    2 of   48 is on borgj102
Process    3 of   48 is on borgj102
Process    4 of   48 is on borgj102
Process    5 of   48 is on borgj102
Process    6 of   48 is on borgj102
Process    7 of   48 is on borgj102
Process    8 of   48 is on borgj102
Process    9 of   48 is on borgj102
Process   10 of   48 is on borgj102
Process   11 of   48 is on borgj102
Process   12 of   48 is on borgj102
Process   13 of   48 is on borgj102
Process   14 of   48 is on borgj102
Process   15 of   48 is on borgj102
Process   16 of   48 is on borgj102
Process   17 of   48 is on borgj102
Process   18 of   48 is on borgj102
Process   19 of   48 is on borgj102
Process   20 of   48 is on borgj102
Process   21 of   48 is on borgj102
Process   22 of   48 is on borgj102
Process   23 of   48 is on borgj102
Process   24 of   48 is on borgj102
Process   25 of   48 is on borgj102
Process   26 of   48 is on borgj102
Process   27 of   48 is on borgj102
Process   28 of   48 is on borgj105
Process   29 of   48 is on borgj105
Process   30 of   48 is on borgj105
Process   31 of   48 is on borgj105
Process   32 of   48 is on borgj105
Process   33 of   48 is on borgj105
Process   34 of   48 is on borgj105
Process   35 of   48 is on borgj105
Process   36 of   48 is on borgj105
Process   37 of   48 is on borgj105
Process   38 of   48 is on borgj105
Process   39 of   48 is on borgj105
Process   40 of   48 is on borgj105
Process   41 of   48 is on borgj105
Process   42 of   48 is on borgj105
Process   43 of   48 is on borgj105
Process   44 of   48 is on borgj105
Process   45 of   48 is on borgj105
Process   46 of   48 is on borgj105
Process   47 of   48 is on borgj105

Huh. No change and still 28,20. Do you know if there is a way to "override" what appears to be SLURM beating the -perhost flag? I suppose there is that srun.slurm warning being thrown, but that usually is a warning for more "tasks-per-core" sort of manipulations.

Thanks,

Matt


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>