Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

memory error occurs only with a certain job launching method and shm:dapl

$
0
0

Hello,

I'm playing with different job launching methods (http://slurm.schedmd.com/mpi_guide.html#intel_mpi), and getting the following error only when I launch a job using srun (my code works fine with mpirun, mpirun --bootstrp=slurm, and mpiexec.hyra) AND using shm:dapl (works fine with shm:tcp).

If I launch the job with

setenv I_MPI_PMI_LIBRARY /usr/lib64/libpmi.so
setenv I_MPI_FABRICS shm:dapl
srun -n 2 my_exec

I get

1: [1] trying to free memory block that is currently involved to uncompleted data transfer operation
1:  free mem  - addr=0x2b7a44547f70 len=1146388320
1:  RTC entry - addr=0x2b7a4bc93a00 len=1254064 cnt=1
1: Assertion failed in file ../../i_rtc_cache.c at line 1338: 0
1: internal ABORT - process 1
0: [0] trying to free memory block that is currently involved to uncompleted data transfer operation
0:  free mem  - addr=0x2ab3a253ff90 len=2723413888
0:  RTC entry - addr=0x2ab3a7aada80 len=1182864 cnt=1
0: Assertion failed in file ../../i_rtc_cache.c at line 1338: 0
0: internal ABORT - process 0

And this error disappears if I set I_MPI_FABRICS to shm:tcp

So what's the difference between srun and other launching methods in this regard? I want to make sure whether this can happen due to a bug in my code (so I need to fix it) or this is just a configuration issue and just not using srun will be sufficient.

 

 

 


Viewing all articles
Browse latest Browse all 927

Trending Articles