I was pleasantly surprised to read that PMI2 & SLURM is supported by Intel MPI in the 2017 release. I tested it, but it fails immediately on my setup. I'm using intel parallel studio 2017 update 4 & SLURM 15.08.13. A simple MPI-program doesn't work:
[donners@int1 pmi2]$ cat mpi.f90
program test
use mpi
implicit none
integer ierr,nprocs,rank
call mpi_init(ierr)
call mpi_comm_size(MPI_COMM_WORLD,nprocs,ierr)
call mpi_comm_rank(mpi_comm_world,rank,ierr)
if (rank .eq. 0) then
print *,'Number of processes: ',nprocs
endif
print*,'I am rank ',rank
call mpi_finalize(ierr)
end
[donners@int1 pmi2]$ mpiifort mpi.f90
[donners@int1 pmi2]$ ldd ./a.out
linux-vdso.so.1 => (0x00007ffcc0364000)
libmpifort.so.12 => /opt/intel/parallel_studio_xe_2017_update4/compilers_and_libraries/linux/mpi/intel64/lib/libmpifort.so.12 (0x00002ad7432a9000)
libmpi.so.12 => /opt/intel/parallel_studio_xe_2017_update4/compilers_and_libraries/linux/mpi/intel64/lib/release_mt/libmpi.so.12 (0x00002ad743652000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002ad744397000)
librt.so.1 => /lib64/librt.so.1 (0x00002ad74459c000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ad7447a4000)
libm.so.6 => /lib64/libm.so.6 (0x00002ad7449c1000)
libc.so.6 => /lib64/libc.so.6 (0x00002ad744c46000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ad744fda000)
/lib64/ld-linux-x86-64.so.2 (0x00002ad743086000)
[donners@int1 pmi2]$ I_MPI_PMI2=yes srun -n 1 --mpi=pmi2 ./a.out
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPID_Init:2104
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1716)......: channel initialization failed
MPID_Init(2104)......: fail failed
srun: error: tcn1467: task 0: Exited with exit code 15
srun: Terminating job step 3270641.0
[donners@int1 pmi2]$ srun --version
slurm 15.08.13-Bull.1.0The same problem occurs on a system with SLURM 17.02.3 (at TACC). What might be the problem here?
With regards,
John