Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

SLURM and I_MPI_JOB_RESPECT_PROCESS_PLACEMENT

$
0
0

I was having issues with Intel MPI 5.x  (5.2.1 and older) not respecting -ppn or -perhost.  Searching this forum I found this post:

https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technolog...

So the original behavior is ignoring -ppn.  I have 2 nodes, ml036 and ml311.  My SLURM_NODELIST is;

SLURM_JOB_NODELIST=ml[036,311]

without setting I_MPI_JOB_RESPECT_PROCESS_PLACEMENT I see ppn ignored:

[green@ml036 ~]$ mpirun -n 2 -ppn 1 ./hello_mpi
hello_parallel.f: Number of tasks=  2 My rank=  0 My name=ml036.localdomain
hello_parallel.f: Number of tasks=  2 My rank=  1 My name=ml036.localdomain

Following that previous post, I

setenv I_MPI_JOB_RESPECT_PROCESS_PLACEMENT disable

then ppn works as expected

[green@ml036 ~]$ setenv I_MPI_JOB_RESPECT_PROCESS_PLACEMENT disable
[green@ml036 ~]$ mpirun -n 2 -ppn 1 ./hello_mpi
hello_parallel.f: Number of tasks=  2 My rank=  0 My name=ml036.localdomain
hello_parallel.f: Number of tasks=  2 My rank=  1 My name=ml311.localdomain

So is this a local configuration issue?  It's easy enough to set this I_MPI_JOB_RESPECT_PROCESS_PLACEMENT env var, but curious what it is and why I have to manually set this.  Shouldn't iMPI figure out I'm on a SLURM system and 'automatically' do the right thing w/o this env var?

with I_MPI_DEBUG 6 and w/o I_MPI_JOB_RESPECT_PROCESS_PLACEMENT I got this:

$ mpirun -n 2 -ppn 1 ./hello_mpi
[0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 1  Build 20161016 (id: 16418)
[0] MPI startup(): Copyright (C) 2003-2016 Intel Corporation.  All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Device_reset_idx=8
[0] MPI startup(): Allgather: 2: 0-0 & 0-2147483647
[0] MPI startup(): Allgather: 3: 1-256 & 0-2147483647
[0] MPI startup(): Allgather: 1: 257-2147483647 & 0-2147483647
[0] MPI startup(): Allgather: 3: 257-5851 & 0-2147483647
[0] MPI startup(): Allgather: 1: 5852-57344 & 0-2147483647
[0] MPI startup(): Allgather: 3: 57345-388846 & 0-2147483647
[0] MPI startup(): Allgather: 1: 388847-1453707 & 0-2147483647
[0] MPI startup(): Allgather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 0-1901 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 1902-2071 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 2072-32768 & 0-2147483647
[0] MPI startup(): Allreduce: 8: 32769-65536 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 65537-131072 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 131073-524288 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 524289-1048576 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 0-131072 & 0-2147483647
[0] MPI startup(): Alltoall: 4: 131073-529941 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 529942-1756892 & 0-2147483647
[0] MPI startup(): Alltoall: 4: 1756893-2097152 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Bcast: 1: 0-0 & 0-2147483647
[0] MPI startup(): Bcast: 8: 1-3938 & 0-2147483647
[0] MPI startup(): Bcast: 1: 3939-4274 & 0-2147483647
[0] MPI startup(): Bcast: 8: 4275-12288 & 0-2147483647
[0] MPI startup(): Bcast: 3: 12289-36805 & 0-2147483647
[0] MPI startup(): Bcast: 7: 36806-95325 & 0-2147483647
[0] MPI startup(): Bcast: 1: 95326-158190 & 0-2147483647
[0] MPI startup(): Bcast: 7: 158191-2393015 & 0-2147483647
[0] MPI startup(): Bcast: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 3: 0-874 & 0-2147483647
[0] MPI startup(): Gather: 1: 875-2048 & 0-2147483647
[0] MPI startup(): Gather: 3: 2049-4096 & 0-2147483647
[0] MPI startup(): Gather: 1: 4097-65536 & 0-2147483647
[0] MPI startup(): Gather: 3: 65537-297096 & 0-2147483647
[0] MPI startup(): Gather: 1: 297097-524288 & 0-2147483647
[0] MPI startup(): Gather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gatherv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 1: 0-6 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 3: 0-0 & 0-2147483647
[0] MPI startup(): Scatter: 1: 1-48 & 0-2147483647
[0] MPI startup(): Scatter: 3: 49-91 & 0-2147483647
[0] MPI startup(): Scatter: 0: 92-201 & 0-2147483647
[0] MPI startup(): Scatter: 3: 202-2048 & 0-2147483647
[0] MPI startup(): Scatter: 1: 2049-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 3: 2049-4751 & 0-2147483647
[0] MPI startup(): Scatter: 0: 4752-12719 & 0-2147483647
[0] MPI startup(): Scatter: 3: 12720-20604 & 0-2147483647
[0] MPI startup(): Scatter: 0: 20605-32768 & 0-2147483647
[0] MPI startup(): Scatter: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatterv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Rank    Pid      Node name          Pin cpu
[0] MPI startup(): 0       99166    ml036.localdomain  {0,1,2,3,4,5,6,7}
[0] MPI startup(): 1       99167    ml036.localdomain  {8,9,10,11,12,13,14,15}
[0] MPI startup(): Recognition=2 Platform(code=8 ippn=1 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[1] MPI startup(): Recognition=2 Platform(code=8 ippn=1 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[0] MPI startup(): I_MPI_DEBUG=6
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=qib0:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 8
hello_parallel.f: Number of tasks=  2 My rank=  1 My name=ml036.localdomain
hello_parallel.f: Number of tasks=  2 My rank=  0 My name=ml036.localdomain

 

SLURM vars

[green@ml015 ~]$ env | grep SLURM
SLURM_NTASKS_PER_NODE=16
SLURM_SUBMIT_DIR=/users/green
SLURM_JOB_ID=534349
SLURM_JOB_NUM_NODES=2
SLURM_JOB_NODELIST=ml[015,017]
SLURM_JOB_CPUS_PER_NODE=16(x2)
SLURM_JOBID=534349
SLURM_NNODES=2
SLURM_NODELIST=ml[015,017]
SLURM_TASKS_PER_NODE=16(x2)
SLURM_NTASKS=32
SLURM_NPROCS=32
SLURM_PRIO_PROCESS=0
SLURM_DISTRIBUTION=cyclic
SLURM_STEPID=0
SLURM_SRUN_COMM_PORT=41294
SLURM_PTY_PORT=43155
SLURM_PTY_WIN_COL=143
SLURM_PTY_WIN_ROW=33
SLURM_STEP_ID=0
SLURM_STEP_NODELIST=ml015
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_LAUNCHER_PORT=41294
SLURM_SRUN_COMM_HOST=192.168.0.153
SLURM_TOPOLOGY_ADDR=ml015
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_TASK_PID=129118
SLURM_CPUS_ON_NODE=16
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
SLURM_LAUNCH_NODE_IPADDR=192.168.0.153
SLURM_GTIDS=0
SLURM_CHECKPOINT_IMAGE_DIR=/users/green
SLURMD_NODENAME=ml015


Viewing all articles
Browse latest Browse all 927

Trending Articles