Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Intel MPI issue with the usage of Slurm

$
0
0

To whom it may concern,

Hello. We are using Slurm to manage our Cluster. However, we met a new issue of Intel MPI with Slurm. When one node reboots, the Intel MPI will fail with that node but manaully restart of slurm daemon will fix it. I also tried to add "service slurm restart" in /etc/rc.local which runs in the end of booting but the issue is still there.

Moreover, I submitted this issue to the slurm-dev but they believed that it was due to Infiniband+IMPI configuration. They suggested me to configure dat.conf and set up some Intel MPI variables. However, I don't know how to set them.

Here is an example:

$ salloc -N1 -n12 -w cn117 #cn117 is the node just rebooted
salloc: Granted job allocation 1201
$ module list
Currently Loaded Modulefiles:
  1) modules                    2) null                       3) intelics/2013.1.039
$ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
$ export I_MPI_FABRICS=shm:ofa
$ srun ./hello
[3] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[4] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[5] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[6] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[7] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[8] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[10] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[11] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[9] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[1] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[2] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
srun: error: cn117: tasks 0-11: Exited with exit code 254
srun: Terminating job step 1201.0

After restarting the slurm daemon:

$ ssh root@cn117
cn117$  service slurm restart
stopping slurmd:                                           [  OK  ]
slurmd is stopped
starting slurmd:                                           [  OK  ]
$ exit
$ salloc -N1 -n12 -w cn117
salloc: Granted job allocation 1203
$ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
$ export I_MPI_FABRICS=shm:ofa
$ srun ./hello
This is Process  9 out of 12 running on host cn117
This is Process  3 out of 12 running on host cn117
This is Process  2 out of 12 running on host cn117
This is Process  7 out of 12 running on host cn117
This is Process  6 out of 12 running on host cn117
This is Process  0 out of 12 running on host cn117
This is Process  5 out of 12 running on host cn117
This is Process  1 out of 12 running on host cn117
This is Process  4 out of 12 running on host cn117
This is Process 10 out of 12 running on host cn117
This is Process  8 out of 12 running on host cn117
This is Process 11 out of 12 running on host cn117

Here is the default dat.conf we have:

# DAT v2.0, v1.2 configuration file
#
# Each entry should have the following fields:
#
# <ia_name> <api_version> <threadsafety> <default> <lib_path> \
#           <provider_version> <ia_params> <platform_params>
#
# For uDAPL cma provder, <ia_params> is one of the following:
#       network address, network hostname, or netdev name and 0 for port
#
# For uDAPL scm provider, <ia_params> is device name and port
# For uDAPL ucm provider, <ia_params> is device name and port
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0
# For uDAPL RoCE provider, <ia_params> is device name and 0
#
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1"""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2"""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0"""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0"""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1"""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2"""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1"""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2"""
ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1"""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0"""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1"""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2"""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1"""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2"""
ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0"""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0"""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1"""
ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2"""
ofa-v2-mcm-1 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 1"""
ofa-v2-mcm-2 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 2"""
ofa-v2-scif0 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "scif0 1"""
ofa-v2-scif0-u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "scif0 1"""
ofa-v2-mic0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mic0:ib 1"""
ofa-v2-mlx4_0-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1"""
ofa-v2-mlx4_0-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2"""
ofa-v2-mlx4_1-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_1 1"""
ofa-v2-mlx4_1-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_1 2"""
ofa-v2-mlx4_1-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_1 1"""
ofa-v2-mlx4_1-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_1 2"""
ofa-v2-mlx4_0-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 1"""
ofa-v2-mlx4_0-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 2"""
ofa-v2-mlx4_1-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_1 1"""
ofa-v2-mlx4_1-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_1 2"""
ofa-v2-mlx5_0-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 1"""
ofa-v2-mlx5_0-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 2"""
ofa-v2-mlx5_1-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_1 1"""
ofa-v2-mlx5_1-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_1 2"""
ofa-v2-mlx5_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 1"""
ofa-v2-mlx5_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 2"""
ofa-v2-mlx5_1-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_1 1"""
ofa-v2-mlx5_1-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_1 2"""
ofa-v2-mlx5_0-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_0 1"""
ofa-v2-mlx5_0-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_0 2"""
ofa-v2-mlx5_1-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_1 1"""
ofa-v2-mlx5_1-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_1 2"""

Some system information here:

$ slurmd -V
slurm 14.03.0

$ mpirun –V
Intel(R) MPI Library for Linux* OS, Version 4.1 Update 1 Build 20130522
Copyright (C) 2003-2013, Intel Corporation. All rights reserved.

cn117$ ofed_info|head -n1
MLNX_OFED_LINUX-2.2-1.0.1 (OFED-2.2-1.0.0):

cn117$ ibv_devinfo
hca_id: mlx4_0
transport:   InfiniBand (0)
fw_ver:    2.11.550
node_guid:
sys_image_guid:   ##########
vendor_id:   ##########
vendor_part_id:   ########
hw_ver:    0x0
board_id:   ########
phys_port_cnt:   2
  port: 1
   state:   PORT_ACTIVE (4)
   max_mtu:  4096 (5)
   active_mtu:  4096 (5)
   sm_lid:   1
   port_lid:  131
   port_lmc:  0x00
   link_layer:  InfiniBand

  port: 2
   state:   PORT_DOWN (1)
   max_mtu:  4096 (5)
   active_mtu:  4096 (5)
   sm_lid:   0
   port_lid:  0
   port_lmc:  0x00
   link_layer:  InfiniBand


cn117$ cat /etc/redhat-release
Red Hat Enterprise Linux Workstation release 6.5 (Santiago)
cn117$ uname –r
2.6.32-431.23.3.el6.x86_64

I wonder if anyone faced similar issue before and could help us to figure out a solution.

Thanks,

Tingyang Xu


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>