Hi guys,
I have a SLURM cluster setup with intelmpi and Ansys CFX.
Here are my settings for the jobs:
export I_MPI_DEBUG=5
export PSM_SHAREDCONTEXTS=1
export PSM_RANKS_PER_CONTEXT=4
export TMI_CONFIG=/etc/tmi.conf
export IPATH_NO_CPUAFFINITY=1
export I_MPI_DEVICE=rddsm
export I_MPI_FALLBACK_DEVICE=disable
export I_MPI_PLATFORM=bdw
export SLURM_CPU_BIND=none
export I_MPI_FABRICS=shm:tmi
export I_MPI_TMI_PROVIDER=psm
export I_MPI_FALLBACK=1
I have also the intelmpi 5.0.3 module loaded under Centos 7
And also the simulation starts but the traffic does not go trought ib0 interfaces.
This is the output from the debug:
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm and tmi data transfer modes
[8] MPI startup(): shm and tmi data transfer modes
[2] MPI startup(): shm and tmi data transfer modes
[10] MPI startup(): shm and tmi data transfer modes
[4] MPI startup(): shm and tmi data transfer modes
[12] MPI startup(): shm and tmi data transfer modes
[1] MPI startup(): shm and tmi data transfer modes
[9] MPI startup(): shm and tmi data transfer modes
[3] MPI startup(): shm and tmi data transfer modes
[15] MPI startup(): shm and tmi data transfer modes
[6] MPI startup(): shm and tmi data transfer modes
[14] MPI startup(): shm and tmi data transfer modes
[5] MPI startup(): shm and tmi data transfer modes
[11] MPI startup(): shm and tmi data transfer modes
[7] MPI startup(): shm and tmi data transfer modes
[13] MPI startup(): shm and tmi data transfer modes
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 12614 qingclinf-01.hpc.cluster {0,1,2,20,21}
[0] MPI startup(): 1 12615 qingclinf-01.hpc.cluster {3,4,22,23,24}
[0] MPI startup(): 2 12616 qingclinf-01.hpc.cluster {5,6,7,25,26}
[0] MPI startup(): 3 12617 qingclinf-01.hpc.cluster {8,9,27,28,29}
[0] MPI startup(): 4 12618 qingclinf-01.hpc.cluster {10,11,12,30,31}
[0] MPI startup(): 5 12619 qingclinf-01.hpc.cluster {13,14,32,33,34}
[0] MPI startup(): 6 12620 qingclinf-01.hpc.cluster {15,16,17,35,36}
[0] MPI startup(): 7 12621 qingclinf-01.hpc.cluster {18,19,37,38,39}
[0] MPI startup(): 8 12441 qingclinf-02.hpc.cluster {0,1,2,20,21}
[0] MPI startup(): 9 12442 qingclinf-02.hpc.cluster {3,4,22,23,24}
[0] MPI startup(): 10 12443 qingclinf-02.hpc.cluster {5,6,7,25,26}
[0] MPI startup(): 11 12444 qingclinf-02.hpc.cluster {8,9,27,28,29}
[0] MPI startup(): 12 12445 qingclinf-02.hpc.cluster {10,11,12,30,31}
[0] MPI startup(): 13 12446 qingclinf-02.hpc.cluster {13,14,32,33,34}
[0] MPI startup(): 14 12447 qingclinf-02.hpc.cluster {15,16,17,35,36}
[0] MPI startup(): 15 12448 qingclinf-02.hpc.cluster {18,19,37,38,39}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm:tmi
[0] MPI startup(): I_MPI_FALLBACK=1
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_DIST=10,21,21,10
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=qib0:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=8:0 0,1 3,2 5,3 8,4 10,5 13,6 15,7 18
[0] MPI startup(): I_MPI_PLATFORM=auto
[0] MPI startup(): I_MPI_TMI_PROVIDER=psm
But there is not traffic over infinabd
inet 10.0.2.1 netmask 255.255.255.0 broadcast 10.0.2.255
inet6 fe80::211:7500:6e:de10 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 121 bytes 23835 (23.2 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 118 bytes 22643 (22.1 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
I did chmod 666 on the /dev/ipath and /dev/infiniband* on the compute nodes.
the /etc/tmi.conf has the library.
Why are the sims running ok and will not over infinband. I can ping and ssh onver infinband but canno use it.
Thanks in advance.