Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

IMPI numa_num Assertion Failed

$
0
0

Hi All - 

I'm getting an error and not quite sure where to begin tracking it down. I'm running a model known to run on our system using:

Intel(R) MPI Library for Linux* OS, Version 2019 Update 6 Build 20191024 (id: 082ae5608)

The code runs for certain core counts(generally smaller processor counts) but errors for some counts with:

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2101: node_info->numa_num <= ((MPIDI_SHMGR_SYNCPAGE_SIZE / MPIDI_SHMGR_FLAG_SPACE) - 1)
  8 Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2101: node_info->numa_num <= ((MPIDI_SHMGR_SYNCPAGE_SIZE / MPIDI_SHMGR_FLAG_SPACE) - 1)
  9 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x7f01f66321d4]
 10 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f01f5dba031]
 11 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x285c5d) [0x7f01f5f34c5d]
 12 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x1975e4) [0x7f01f5e465e4]
 13 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x16bd8e) [0x7f01f5e1ad8e]
 14 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x2886a7) [0x7f01f5f376a7]
 15 /apps/applications/development/compilers/intel/1
 16 Abort(1) on node 2: Internal error
 17 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x7f983aed31d4]
 18 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f983a65b031]
 19 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x285c5d) [0x7f983a7d5c5d]
 20 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x1975e4) [0x7f983a6e75e4]
 21 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x16bd8e) [0x7f983a6bbd8e]
 22 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x2886a7) [0x7f983a7d86a7]
 23 /apps/applications/development/compilers/intel/1
 24 Abort(1) on node 3: Internal error
 25 Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2101: node_info->numa_num <= ((MPIDI_SHMGR_SYNCPAGE_SIZE / MPIDI_SHMGR_FLAG_SPACE) - 1)
 26 Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2101: node_info->numa_num <= ((MPIDI_SHMGR_SYNCPAGE_SIZE / MPIDI_SHMGR_FLAG_SPACE) - 1)
 27 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x7f9ede95a1d4]
 28 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f9ede0e2031]
 29 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x285c5d) [0x7f9ede25cc5d]
 30 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x1975e4) [0x7f9ede16e5e4]
 31 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x16bd8e) [0x7f9ede142d8e]
 32 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x2886a7) [0x7f9ede25f6a7]
 33 /apps/applications/development/compilers/intel/1
 34 Abort(1) on node 1: Internal error
 35 Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2101: node_info->numa_num <= ((MPIDI_SHMGR_SYNCPAGE_SIZE / MPIDI_SHMGR_FLAG_SPACE) - 1)
 36 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x34) [0x7ff6ff7ae1d4]
 37 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7ff6fef36031]
 38 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x285c5d) [0x7ff6ff0b0c5d]
 39 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x1975e4) [0x7ff6fefc25e4]
 40 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x16bd8e) [0x7ff6fef96d8e]
 41 /apps/applications/development/compilers/intel/19.1/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib/release/libmpi.so.12(+0x2886a7) [0x7ff6ff0b36a7]
 42 /apps/applications/development/compilers/intel/1
 43 Abort(1) on node 84: Internal error

The code does run correctly with this core configuration when started through SLURM using "srun --mpi=pmi2". Can you provide any guidance? 

The machine this is running on is a dual-socket AMD Epyc 7702 with hyperthreading disabled and Ubuntu 18.04 server

Thanks


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>