Dear expert,
I seek confirmation that I am doing stuff properly. Here my situation. The new cluster in my institution has two Mellanox Connect-IB cards on each node. Each node is a dual socket six-core Ivy Bridge. The node architecture is such that each socket is connected with a straight PCIe lane to each IB card. What I want to do is basically assign a subset of the MPI processes (e.g. the first 6) to first IB card and the other MPI processes to the second IB card. No rail sharing, for both small and large messages a MPI should use one single (assigned) IB card.
Here what I did...
export I_MPI_FABRICS=shm:ofa
export I_MPI_OFA_NUM_ADAPTERS=2
export I_MPI_OFA_ADAPTER_NAME=mlx5_0,mlx5_1
export I_MPI_OFA_RAIL_SCHEDULER=PROCESS_BIND
export I_MPI_PIN_DOMAIN=core
export I_MPI_PIN_ORDER=scatter
export I_MPI_DEBUG=6
mpirun -genvall -print-rank-map -np 24 -ppn 12 ./run_dual_bind <exe>
The "run_dual_bind" script contains...
#!/bin/bash
lrank=$(($PMI_RANK % 12))
case ${lrank} in
0|1|2|3|4|5)
export CUDA_VISIBLE_DEVICES=0
export I_MPI_OFA_NUM_ADAPTERS=1
export I_MPI_OFA_ADAPTER_NAME=mlx5_0
"$@"
;;
6|7|8|9|10|11)
export I_MPI_OFA_NUM_ADAPTERS=1
export I_MPI_OFA_ADAPTER_NAME=mlx5_1
"$@"
;;
esac
In theory it should work. I can verify the MPI bindind looking at the mpirun output but I have no idea if the interconnect I want to use is really used.
Am I doing stuff properly? Is this the exact way to realize fine-grain rail binding?
Many thanks in advance. I also take the opportunity to wish everybody a Happy New Year!
Filippo