Hello,
I'm playing with different job launching methods (http://slurm.schedmd.com/mpi_guide.html#intel_mpi), and getting the following error only when I launch a job using srun (my code works fine with mpirun, mpirun --bootstrp=slurm, and mpiexec.hyra) AND using shm:dapl (works fine with shm:tcp).
If I launch the job with
setenv I_MPI_PMI_LIBRARY /usr/lib64/libpmi.so
setenv I_MPI_FABRICS shm:dapl
srun -n 2 my_exec
I get
1: [1] trying to free memory block that is currently involved to uncompleted data transfer operation
1: free mem - addr=0x2b7a44547f70 len=1146388320
1: RTC entry - addr=0x2b7a4bc93a00 len=1254064 cnt=1
1: Assertion failed in file ../../i_rtc_cache.c at line 1338: 0
1: internal ABORT - process 1
0: [0] trying to free memory block that is currently involved to uncompleted data transfer operation
0: free mem - addr=0x2ab3a253ff90 len=2723413888
0: RTC entry - addr=0x2ab3a7aada80 len=1182864 cnt=1
0: Assertion failed in file ../../i_rtc_cache.c at line 1338: 0
0: internal ABORT - process 0
And this error disappears if I set I_MPI_FABRICS to shm:tcp
So what's the difference between srun and other launching methods in this regard? I want to make sure whether this can happen due to a bug in my code (so I need to fix it) or this is just a configuration issue and just not using srun will be sufficient.