I'm developing a MPI application, which relies heavily on the MPI shared memory. Recently, I keep hitting the following error messages:
srun: error: compute-42-013: task 32: Bus error
srun: Terminating job step 324080.0
slurmstepd: error: *** STEP 324080.0 ON compute-42-012 CANCELLED AT 2020-06-14T04:17:51 ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
pVelodyne_intel_4 000000000C8E308E Unknown Unknown Unknown
libpthread-2.17.s 00002B370FBBA5D0 Unknown Unknown Unknown
pVelodyne_intel_4 000000000397721B PMPIDI_CH3I_Progr 1040 ch3_progress.c
pVelodyne_intel_4 00000000039FC370 MPIC_Wait 269 helper_fns.c
pVelodyne_intel_4 00000000039FD83A MPIC_Sendrecv 580 helper_fns.c
pVelodyne_intel_4 000000000392F61B MPIR_Allgather_in 257 allgather.c
pVelodyne_intel_4 0000000003931752 MPIR_Allgather 858 allgather.c
pVelodyne_intel_4 0000000003931A77 MPIR_Allgather_im 905 allgather.c
pVelodyne_intel_4 0000000003933226 PMPI_Allgather 1068 allgather.c
pVelodyne_intel_4 000000000392CECE Unknown Unknown Unknown
srun: error: compute-41-006: task 16: Bus error
srun: Terminating job step 324024.0
slurmstepd: error: *** STEP 324024.0 ON compute-41-006 CANCELLED AT 2020-06-13T16:54:13 ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
pVelodyne_intel_4 000000000C85058E Unknown Unknown Unknown
libpthread-2.17.s 00002AEBC007A5D0 Unknown Unknown Unknown
pVelodyne_intel_4 00000000038E46DB PMPIDI_CH3I_Progr 1040 ch3_progress.c
pVelodyne_intel_4 0000000003969830 MPIC_Wait 269 helper_fns.c
pVelodyne_intel_4 000000000396ACFA MPIC_Sendrecv 580 helper_fns.c
pVelodyne_intel_4 00000000038BA379 MPIR_Alltoall_int 438 alltoall.c
pVelodyne_intel_4 00000000038BBE3D MPIR_Alltoall 734 alltoall.c
pVelodyne_intel_4 00000000038BC162 MPIR_Alltoall_imp 775 alltoall.c
pVelodyne_intel_4 00000000038BD875 PMPI_Alltoall 958 alltoall.c
It seems the bus error is inside the MPI subroutine. Since I do not have the source code of intel MPI, I have no idea what went wrong.
The intel mpi version I'm using is intel_parallel_studio/2018u4/compilers_and_libraries_2018.5.274.
Any idea how to fix it?
Thanks.