Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

IMPI w/ Slurm

$
0
0

I'm working at a site configured with IMPI (2016.4.072) / Slurm (17.11.4).  The MpiDefault is none.

When I run my MPICH2 code (defaulting to --mpi=none)

     srun -N 2 -n 4 -l -vv ...

I get (trimming out duplicate error messages from other ranks)

0: PMII_singinit: execv failed: No such file or directory

0: [unset]:   This singleton init program attempted to access some feature

0: [unset]:   for which process manager support was required, e.g. spawn or universe_size.

0: [unset]:   But the necessary mpiexec is not in your path.

0: [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_18014_0 key=P2-hostname

0: :

0: system msg for write_line failure : Bad file descriptor

0: [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_18014_0 key=P3-hostname

0: :

0: system msg for write_line failure : Bad file descriptor

0: 2018-05-25 09:00:14  2: MPI startup(): Multi-threaded optimized library

0: 2018-05-25 09:00:14  2: DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u

0: 2018-05-25 09:00:14  2: MPI startup(): DAPL provider ofa-v2-mlx4_0-1u

0: 2018-05-25 09:00:14  2: MPI startup(): shm and dapl data transfer modes

0: [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_18417_0 key=P1-businesscard-0

0: :

0: system msg for write_line failure : Bad file descriptor

0: [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=foobar key=foobar

0: :

0: system msg for write_line failure : Bad file descriptor

0: [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_18417_0 key=P1-businesscard-0

0: :

0: system msg for write_line failure : Bad file descriptor

0: Fatal error in PMPI_Init_thread: Other MPI error, error stack:

0: MPIR_Init_thread(784).................:

0: MPID_Init(1332).......................: channel initialization failed

0: MPIDI_CH3_Init(141)...................:

0: dapl_rc_setup_all_connections_20(1388): generic failure with errno = 872614415

0: getConnInfoKVS(849)...................: PMI_KVS_Get failed

 

If I run the same code with

 

   srun --mpi=pmi2 ...

 

it works fine.

 

A couple of questions/comments:

1. In neither case do I set I_MPI_PMI_LIBRARY, which I thought I needed to -- how else does IMPI find the Slurm PMI?  This might be why --mpi=none is failing, but for the moment, I can't set the variable because I can't find libpmi[1,2,x].so.

2. I would think that since none is the default, it should work.  Under what conditions would none fail, but pmi2 work?  Is it because IMPI supports pmi2?

3. If I do need to set I_MPI_PMI_LIBRARY, why does pmi2 still work without setting I_MPI_PMI_LIBRARY?  Or do I not need to set it when using IMPI?

4. I'm still trying to understand a bit more of the correlation between libpmi.so and mpi_*.so.  libpmi.so is the Slurm PMI library, correct?  And mpi_* are the Slurm plug-in libraries (e.g. mpi_none, mpi_pmi2, etc.).  How do these libraries fit together?

 

Thanks,

Raymond


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>