Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

INTERNAL ERROR with SLURM and PMI2

$
0
0

I was pleasantly surprised to read that PMI2 & SLURM is supported by Intel MPI in the 2017 release. I tested it, but it fails immediately on my setup.  I'm using intel parallel studio 2017 update 4 & SLURM 15.08.13. A simple MPI-program doesn't work:

[donners@int1 pmi2]$ cat mpi.f90
program test
  use mpi
  implicit none

  integer ierr,nprocs,rank

  call mpi_init(ierr)
  call mpi_comm_size(MPI_COMM_WORLD,nprocs,ierr)
  call mpi_comm_rank(mpi_comm_world,rank,ierr)
  if (rank .eq. 0) then
    print *,'Number of processes: ',nprocs
  endif
  print*,'I am rank ',rank
  call mpi_finalize(ierr)

end
[donners@int1 pmi2]$ mpiifort mpi.f90
[donners@int1 pmi2]$ ldd ./a.out
    linux-vdso.so.1 =>  (0x00007ffcc0364000)
    libmpifort.so.12 => /opt/intel/parallel_studio_xe_2017_update4/compilers_and_libraries/linux/mpi/intel64/lib/libmpifort.so.12 (0x00002ad7432a9000)
    libmpi.so.12 => /opt/intel/parallel_studio_xe_2017_update4/compilers_and_libraries/linux/mpi/intel64/lib/release_mt/libmpi.so.12 (0x00002ad743652000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002ad744397000)
    librt.so.1 => /lib64/librt.so.1 (0x00002ad74459c000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ad7447a4000)
    libm.so.6 => /lib64/libm.so.6 (0x00002ad7449c1000)
    libc.so.6 => /lib64/libc.so.6 (0x00002ad744c46000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ad744fda000)
    /lib64/ld-linux-x86-64.so.2 (0x00002ad743086000)
[donners@int1 pmi2]$ I_MPI_PMI2=yes srun -n 1 --mpi=pmi2 ./a.out

INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPID_Init:2104
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1716)......: channel initialization failed
MPID_Init(2104)......: fail failed
srun: error: tcn1467: task 0: Exited with exit code 15
srun: Terminating job step 3270641.0

[donners@int1 pmi2]$ srun --version
slurm 15.08.13-Bull.1.0

The same problem occurs on a system with SLURM 17.02.3 (at TACC). What might be the problem here?

With regards,

John

 


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>