Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

SLURM Intelmpi cannot use Infiniband only ethernet QLogic Intel switch and interfaces

$
0
0

Hi guys,

I have a SLURM cluster setup with intelmpi and Ansys CFX.

Here are my settings for the jobs:

export I_MPI_DEBUG=5
export PSM_SHAREDCONTEXTS=1
export PSM_RANKS_PER_CONTEXT=4
export TMI_CONFIG=/etc/tmi.conf
export IPATH_NO_CPUAFFINITY=1
export I_MPI_DEVICE=rddsm
export I_MPI_FALLBACK_DEVICE=disable
export I_MPI_PLATFORM=bdw
export SLURM_CPU_BIND=none
export I_MPI_FABRICS=shm:tmi
export I_MPI_TMI_PROVIDER=psm
export I_MPI_FALLBACK=1

I have also the intelmpi 5.0.3 module loaded under Centos 7

And also the simulation starts but the traffic does not go trought ib0 interfaces.

This is the output from the debug:

[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm and tmi data transfer modes
[8] MPI startup(): shm and tmi data transfer modes
[2] MPI startup(): shm and tmi data transfer modes
[10] MPI startup(): shm and tmi data transfer modes
[4] MPI startup(): shm and tmi data transfer modes
[12] MPI startup(): shm and tmi data transfer modes
[1] MPI startup(): shm and tmi data transfer modes
[9] MPI startup(): shm and tmi data transfer modes
[3] MPI startup(): shm and tmi data transfer modes
[15] MPI startup(): shm and tmi data transfer modes
[6] MPI startup(): shm and tmi data transfer modes
[14] MPI startup(): shm and tmi data transfer modes
[5] MPI startup(): shm and tmi data transfer modes
[11] MPI startup(): shm and tmi data transfer modes
[7] MPI startup(): shm and tmi data transfer modes
[13] MPI startup(): shm and tmi data transfer modes
[0] MPI startup(): Rank    Pid      Node name                 Pin cpu
[0] MPI startup(): 0       12614    qingclinf-01.hpc.cluster  {0,1,2,20,21}
[0] MPI startup(): 1       12615    qingclinf-01.hpc.cluster  {3,4,22,23,24}
[0] MPI startup(): 2       12616    qingclinf-01.hpc.cluster  {5,6,7,25,26}
[0] MPI startup(): 3       12617    qingclinf-01.hpc.cluster  {8,9,27,28,29}
[0] MPI startup(): 4       12618    qingclinf-01.hpc.cluster  {10,11,12,30,31}
[0] MPI startup(): 5       12619    qingclinf-01.hpc.cluster  {13,14,32,33,34}
[0] MPI startup(): 6       12620    qingclinf-01.hpc.cluster  {15,16,17,35,36}
[0] MPI startup(): 7       12621    qingclinf-01.hpc.cluster  {18,19,37,38,39}
[0] MPI startup(): 8       12441    qingclinf-02.hpc.cluster  {0,1,2,20,21}
[0] MPI startup(): 9       12442    qingclinf-02.hpc.cluster  {3,4,22,23,24}
[0] MPI startup(): 10      12443    qingclinf-02.hpc.cluster  {5,6,7,25,26}
[0] MPI startup(): 11      12444    qingclinf-02.hpc.cluster  {8,9,27,28,29}
[0] MPI startup(): 12      12445    qingclinf-02.hpc.cluster  {10,11,12,30,31}
[0] MPI startup(): 13      12446    qingclinf-02.hpc.cluster  {13,14,32,33,34}
[0] MPI startup(): 14      12447    qingclinf-02.hpc.cluster  {15,16,17,35,36}
[0] MPI startup(): 15      12448    qingclinf-02.hpc.cluster  {18,19,37,38,39}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm:tmi
[0] MPI startup(): I_MPI_FALLBACK=1
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_DIST=10,21,21,10
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=qib0:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=8:0 0,1 3,2 5,3 8,4 10,5 13,6 15,7 18
[0] MPI startup(): I_MPI_PLATFORM=auto
[0] MPI startup(): I_MPI_TMI_PROVIDER=psm

 

But there is not traffic over infinabd

        inet 10.0.2.1  netmask 255.255.255.0  broadcast 10.0.2.255
        inet6 fe80::211:7500:6e:de10  prefixlen 64  scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 121  bytes 23835 (23.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 118  bytes 22643 (22.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

 

I did chmod 666 on the /dev/ipath and /dev/infiniband* on the compute nodes.

the /etc/tmi.conf has the library.

Why are the sims running ok and will not over infinband. I can ping and ssh onver infinband but canno use it.

Thanks in advance.

Zone: 

Thread Topic: 

Help Me

Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>