I'm working at a site configured with IMPI (2016.4.072) / Slurm (17.11.4). The MpiDefault is none.
When I run my MPICH2 code (defaulting to --mpi=none)
srun -N 2 -n 4 -l -vv ...
I get (trimming out duplicate error messages from other ranks)
0: PMII_singinit: execv failed: No such file or directory
0: [unset]: This singleton init program attempted to access some feature
0: [unset]: for which process manager support was required, e.g. spawn or universe_size.
0: [unset]: But the necessary mpiexec is not in your path.
0: [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_18014_0 key=P2-hostname
0: :
0: system msg for write_line failure : Bad file descriptor
0: [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_18014_0 key=P3-hostname
0: :
0: system msg for write_line failure : Bad file descriptor
0: 2018-05-25 09:00:14 2: MPI startup(): Multi-threaded optimized library
0: 2018-05-25 09:00:14 2: DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
0: 2018-05-25 09:00:14 2: MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
0: 2018-05-25 09:00:14 2: MPI startup(): shm and dapl data transfer modes
0: [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_18417_0 key=P1-businesscard-0
0: :
0: system msg for write_line failure : Bad file descriptor
0: [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=foobar key=foobar
0: :
0: system msg for write_line failure : Bad file descriptor
0: [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_18417_0 key=P1-businesscard-0
0: :
0: system msg for write_line failure : Bad file descriptor
0: Fatal error in PMPI_Init_thread: Other MPI error, error stack:
0: MPIR_Init_thread(784).................:
0: MPID_Init(1332).......................: channel initialization failed
0: MPIDI_CH3_Init(141)...................:
0: dapl_rc_setup_all_connections_20(1388): generic failure with errno = 872614415
0: getConnInfoKVS(849)...................: PMI_KVS_Get failed
If I run the same code with
srun --mpi=pmi2 ...
it works fine.
A couple of questions/comments:
1. In neither case do I set I_MPI_PMI_LIBRARY, which I thought I needed to -- how else does IMPI find the Slurm PMI? This might be why --mpi=none is failing, but for the moment, I can't set the variable because I can't find libpmi[1,2,x].so.
2. I would think that since none is the default, it should work. Under what conditions would none fail, but pmi2 work? Is it because IMPI supports pmi2?
3. If I do need to set I_MPI_PMI_LIBRARY, why does pmi2 still work without setting I_MPI_PMI_LIBRARY? Or do I not need to set it when using IMPI?
4. I'm still trying to understand a bit more of the correlation between libpmi.so and mpi_*.so. libpmi.so is the Slurm PMI library, correct? And mpi_* are the Slurm plug-in libraries (e.g. mpi_none, mpi_pmi2, etc.). How do these libraries fit together?
Thanks,
Raymond