Hi,
Intel Parallel Studio Cluster Edition, 2017 Update 5, on CentOS 7.3
I am trying to run a hybrid parallel NWChem job with 2 ranks per 24-core node, 12 threads per rank. The underlying ARMCI library seems to expect consecutive ranks to reside on the same node, i.e., ranks 0 and 1 on node 1, ranks 2 and 3 on node 2, etc. With the simple "mpirun ... -perhost 2", I see round-robin assignment, instead of the documented group-round-robin assignment (4 nodes):
[cchang@login1 03:38:06 /scratch/cchang/C6H6_CCSD_NWC]$ mpirun -h | grep perhost
-perhost place consecutive processes on each host
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 16252 n1757 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 1 3900 n1756 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 2 28323 n1738 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 3 13358 n1733 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 4 16253 n1757 {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 5 3901 n1756 {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 6 28324 n1738 {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 7 13359 n1733 {6,7,8,9,10,11,18,19,20,21,22,23}
If I try I_MPI_PIN_DOMAIN, or using a hexmap in the nodefile, all ranks end up on the same node:
[cchang@login1 03:31:24 /scratch/cchang/C6H6_CCSD_NWC]$ cat nodefile
n2123:2 binding=map=[03F03F,FC0FC0]
n1942:2 binding=map=[03F03F,FC0FC0]
n1915:2 binding=map=[03F03F,FC0FC0]
n1876:2 binding=map=[03F03F,FC0FC0]
[cchang@login1 03:31:27 /scratch/cchang/C6H6_CCSD_NWC]$ head -20 proc8.log
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[4] MPI startup(): shm data transfer mode
[5] MPI startup(): shm data transfer mode
[6] MPI startup(): shm data transfer mode
[7] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 8510 n2123 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 1 8511 n2123 {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 2 8512 n2123 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 3 8513 n2123 {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 4 8514 n2123 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 5 8515 n2123 {6,7,8,9,10,11,18,19,20,21,22,23}
[0] MPI startup(): 6 8516 n2123 {0,1,2,3,4,5,12,13,14,15,16,17}
[0] MPI startup(): 7 8517 n2123 {6,7,8,9,10,11,18,19,20,21,22,23}
...
What is Intel's preferred mechanism to achieve paired consecutive ranks, with each multi-threaded rank bound to a socket?
Thanks; Chris