Hi All -
I'm getting an error and not quite sure where to begin tracking it down. I'm running a model known to run on our system using:
Intel(R) MPI Library for Linux* OS, Version 2019 Update 6 Build 20191024 (id: 082ae5608)
The code runs for certain core counts(generally smaller processor counts) but errors for some counts with:
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2101: node_info->numa_num <= ((MPIDI_SHMGR_SYNCPAGE_SIZE / MPIDI_SHMGR_FLAG_SPACE) - 1) 8 Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2101: node_info->numa_num <= ((MPIDI_SHMGR_SYNCPAGE_SIZE / MPIDI_SHMGR_FLAG_SPACE) - 1) 9 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x7f01f66321d4] 10 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f01f5dba031] 11 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x285c5d) [0x7f01f5f34c5d] 12 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x1975e4) [0x7f01f5e465e4] 13 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x16bd8e) [0x7f01f5e1ad8e] 14 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x2886a7) [0x7f01f5f376a7] 15 /apps/applications/development/compilers/intel/1 16 Abort(1) on node 2: Internal error 17 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x7f983aed31d4] 18 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f983a65b031] 19 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x285c5d) [0x7f983a7d5c5d] 20 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x1975e4) [0x7f983a6e75e4] 21 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x16bd8e) [0x7f983a6bbd8e] 22 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x2886a7) [0x7f983a7d86a7] 23 /apps/applications/development/compilers/intel/1 24 Abort(1) on node 3: Internal error 25 Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2101: node_info->numa_num <= ((MPIDI_SHMGR_SYNCPAGE_SIZE / MPIDI_SHMGR_FLAG_SPACE) - 1) 26 Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2101: node_info->numa_num <= ((MPIDI_SHMGR_SYNCPAGE_SIZE / MPIDI_SHMGR_FLAG_SPACE) - 1) 27 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x7f9ede95a1d4] 28 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f9ede0e2031] 29 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x285c5d) [0x7f9ede25cc5d] 30 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x1975e4) [0x7f9ede16e5e4] 31 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x16bd8e) [0x7f9ede142d8e] 32 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x2886a7) [0x7f9ede25f6a7] 33 /apps/applications/development/compilers/intel/1 34 Abort(1) on node 1: Internal error 35 Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2101: node_info->numa_num <= ((MPIDI_SHMGR_SYNCPAGE_SIZE / MPIDI_SHMGR_FLAG_SPACE) - 1) 36 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x7ff6ff7ae1d4] 37 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7ff6fef36031] 38 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x285c5d) [0x7ff6ff0b0c5d] 39 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x1975e4) [0x7ff6fefc25e4] 40 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x16bd8e) [0x7ff6fef96d8e] 41 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x2886a7) [0x7ff6ff0b36a7] 42 /apps/applications/development/compilers/intel/1 43 Abort(1) on node 84: Internal error
The code does run correctly with this core configuration when started through SLURM using "srun --mpi=pmi2". Can you provide any guidance?
The machine this is running on is a dual-socket AMD Epyc 7702 with hyperthreading disabled and Ubuntu 18.04 server
Thanks