To whom it may concern,
Hello. We are using Slurm to manage our Cluster. However, we met a new issue of Intel MPI with Slurm. When one node reboots, the Intel MPI will fail with that node but manaully restart of slurm daemon will fix it. I also tried to add "service slurm restart" in /etc/rc.local which runs in the end of booting but the issue is still there.
Moreover, I submitted this issue to the slurm-dev but they believed that it was due to Infiniband+IMPI configuration. They suggested me to configure dat.conf and set up some Intel MPI variables. However, I don't know how to set them.
Here is an example:
$ salloc -N1 -n12 -w cn117 #cn117 is the node just rebooted salloc: Granted job allocation 1201 $ module list Currently Loaded Modulefiles: 1) modules 2) null 3) intelics/2013.1.039 $ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so $ export I_MPI_FABRICS=shm:ofa $ srun ./hello [3] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [4] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [5] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [6] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [7] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [8] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [10] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [11] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [9] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [1] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [2] MPI startup(): ofa fabric is not available and fallback fabric is not enabled srun: error: cn117: tasks 0-11: Exited with exit code 254 srun: Terminating job step 1201.0
After restarting the slurm daemon:
$ ssh root@cn117 cn117$ service slurm restart stopping slurmd: [ OK ] slurmd is stopped starting slurmd: [ OK ] $ exit $ salloc -N1 -n12 -w cn117 salloc: Granted job allocation 1203 $ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so $ export I_MPI_FABRICS=shm:ofa $ srun ./hello This is Process 9 out of 12 running on host cn117 This is Process 3 out of 12 running on host cn117 This is Process 2 out of 12 running on host cn117 This is Process 7 out of 12 running on host cn117 This is Process 6 out of 12 running on host cn117 This is Process 0 out of 12 running on host cn117 This is Process 5 out of 12 running on host cn117 This is Process 1 out of 12 running on host cn117 This is Process 4 out of 12 running on host cn117 This is Process 10 out of 12 running on host cn117 This is Process 8 out of 12 running on host cn117 This is Process 11 out of 12 running on host cn117
Here is the default dat.conf we have:
# DAT v2.0, v1.2 configuration file # # Each entry should have the following fields: # # <ia_name> <api_version> <threadsafety> <default> <lib_path> \ # <provider_version> <ia_params> <platform_params> # # For uDAPL cma provder, <ia_params> is one of the following: # network address, network hostname, or netdev name and 0 for port # # For uDAPL scm provider, <ia_params> is device name and port # For uDAPL ucm provider, <ia_params> is device name and port # For uDAPL iWARP provider, <ia_params> is netdev device name and 0 # For uDAPL iWARP provider, <ia_params> is netdev device name and 0 # For uDAPL RoCE provider, <ia_params> is device name and 0 # ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1""" ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2""" ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0""" ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0""" ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1""" ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2""" ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1""" ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2""" ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1""" ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0""" ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1""" ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2""" ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1""" ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2""" ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0""" ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0""" ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1""" ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2""" ofa-v2-mcm-1 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 1""" ofa-v2-mcm-2 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 2""" ofa-v2-scif0 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "scif0 1""" ofa-v2-scif0-u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "scif0 1""" ofa-v2-mic0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mic0:ib 1""" ofa-v2-mlx4_0-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1""" ofa-v2-mlx4_0-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2""" ofa-v2-mlx4_1-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_1 1""" ofa-v2-mlx4_1-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_1 2""" ofa-v2-mlx4_1-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_1 1""" ofa-v2-mlx4_1-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_1 2""" ofa-v2-mlx4_0-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 1""" ofa-v2-mlx4_0-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 2""" ofa-v2-mlx4_1-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_1 1""" ofa-v2-mlx4_1-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_1 2""" ofa-v2-mlx5_0-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 1""" ofa-v2-mlx5_0-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 2""" ofa-v2-mlx5_1-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_1 1""" ofa-v2-mlx5_1-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_1 2""" ofa-v2-mlx5_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 1""" ofa-v2-mlx5_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 2""" ofa-v2-mlx5_1-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_1 1""" ofa-v2-mlx5_1-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_1 2""" ofa-v2-mlx5_0-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_0 1""" ofa-v2-mlx5_0-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_0 2""" ofa-v2-mlx5_1-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_1 1""" ofa-v2-mlx5_1-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_1 2"""
Some system information here:
$ slurmd -V slurm 14.03.0 $ mpirun –V Intel(R) MPI Library for Linux* OS, Version 4.1 Update 1 Build 20130522 Copyright (C) 2003-2013, Intel Corporation. All rights reserved. cn117$ ofed_info|head -n1 MLNX_OFED_LINUX-2.2-1.0.1 (OFED-2.2-1.0.0): cn117$ ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.11.550 node_guid: sys_image_guid: ########## vendor_id: ########## vendor_part_id: ######## hw_ver: 0x0 board_id: ######## phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 131 port_lmc: 0x00 link_layer: InfiniBand port: 2 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: InfiniBand cn117$ cat /etc/redhat-release Red Hat Enterprise Linux Workstation release 6.5 (Santiago) cn117$ uname –r 2.6.32-431.23.3.el6.x86_64
I wonder if anyone faced similar issue before and could help us to figure out a solution.
Thanks,
Tingyang Xu