Intel MPI 2019 on Linux was installed and tested with several MPI programs (gcc, g++, gfortran from GCC 8.2), with no issues, using the following environment setup.
export I_MPI_DEBUG=5 export I_MPI_LIBRARY_KIND=debug export I_MPI_OFI_LIBRARY_INTERNAL=1 . ~/intel/compilers_and_libraries_2019.0.117/linux/mpi/intel64/bin/mpivars.sh
I did create a symlink 'mpifort' pointing to mpifc (for compatibility with the mpich/OpenMPI way of doing things).
I've been trying to get OpenCoarrays-2.2.0 (opencoarrays.org) working with Intel MPI 2019, on Linux, for gfortran (GCC 8.2) to implement a coarray-fortran (caf) development implementation. Since OpenCoarrays is developed and tested against the mpich MPI implementation, I was optimistic that Intel MPI could work too, based on mpich ABI compatibility.
The install.sh script that can be used to build OpenCoarrays finds the expected ULFM routines (see fault-tolerance.org), and builds libcaf_mpi.so with the compiler variable, -DUSE_FAILED_IMAGES defined.
-- Looking for signal.h - found -- Looking for SIGKILL -- Looking for SIGKILL - found -- Looking for include files mpi.h, mpi-ext.h -- Looking for include files mpi.h, mpi-ext.h - not found -- Looking for MPIX_ERR_PROC_FAILED -- Looking for MPIX_ERR_PROC_FAILED - found -- Looking for MPIX_ERR_REVOKED -- Looking for MPIX_ERR_REVOKED - found -- Looking for MPIX_Comm_failure_ack -- Looking for MPIX_Comm_failure_ack - found -- Looking for MPIX_Comm_failure_get_acked -- Looking for MPIX_Comm_failure_get_acked - found -- Looking for MPIX_Comm_shrink -- Looking for MPIX_Comm_shrink - found -- Looking for MPIX_Comm_agree -- Looking for MPIX_Comm_agree - found
However, when attempting to execute code compiled (with mpifc) from coarray fortran under mpirun, there is a failed assertion, as shown below. This output is from building the mpi_caf.o and caf_auxiliary.o object files that comprise libcaf_mpi.so, when compiled with -g and linked to a coarray fortran program (using -fcoarray=lib) also compiled with -g (and other relevant settings obtained from caf -show, mpicc -show, and mpifc -show). See "Assertion failed in file ../../src/mpid/ch4/src/ch4_comm.h at line 89: 0".
[bmaggard@localhost oca]$ mpiexec.hydra -genv I_MPI_DEBUG=5 -gdb -n 1 ./a.out mpigdb: attaching to 17651 ./a.out localhost.localdomain [0] (mpigdb) start [0] The program being debugged has been started already. [0] Start it from the beginning? (y or n) [answered Y; input not from terminal] [0] Temporary breakpoint 1 at 0x402737: file pi_caf.f90, line 1. [0] Starting program: /home/bmaggard/oca/a.out [bmaggard@localhost oca]$ [0] [Thread debugging using libthread_db enabled] [0] Using host libthread_db library "/lib64/libthread_db.so.1". [0] [New Thread 0x7ffff42bf700 (LWP 17694)] [0] [New Thread 0x7ffff3abe700 (LWP 17695)] [0] Detaching after fork from child process 17696. [0] Assertion failed in file ../../src/mpid/ch4/src/ch4_comm.h at line 89: 0 [0] /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(+0xbb298e) [0x7ffff6a9f98e] [0] /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(MPL_backtrace_show+0x18) [0x7ffff6a9fafd] [0] /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(MPIR_Assert_fail+0x5c) [0x7ffff6101e0b] [0] /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(+0x2fc72c) [0x7ffff61e972c] [0] /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(+0x2fc832) [0x7ffff61e9832] [0] /home/bmaggard/intel//compilers_and_libraries_2019.0.117/linux/mpi/intel64/lib/debug/libmpi.so.12(MPIX_Comm_agree+0x518) [0x7ffff61ea221] [0] /home/bmaggard/oca/a.out() [0x40399b] [0] /home/bmaggard/oca/a.out() [0x404083] [0] /home/bmaggard/oca/a.out() [0x402716] [0] /home/bmaggard/oca/a.out() [0x416fc5] [0] /lib64/libc.so.6(__libc_start_main+0x7a) [0x7ffff4f150aa] [0] /home/bmaggard/oca [0] Abort(1) on node 0: Internal error [0] [Thread 0x7ffff3abe700 (LWP 17695) exited] [0] [Thread 0x7ffff42bf700 (LWP 17694) exited] [0] [Inferior 1 (process 17680) exited with code 01] [0] (mpigdb) mpigdb: ending.. mpigdb: kill 17651
The same assertion failure was observed under Win64, and there is a bit more information indicating where to look:
[0] MPI startup(): libfabric version: 1.6.1a1-impi [0] MPI startup(): libfabric provider: sockets [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 8364 pe-mgr-laptop {0,1,2,3,4,5,6,7} Assertion failed in file c:\iusers\jenkins\workspace\ch4-build-windows\impi-ch4-build-windows-builder\\src\mpid\ch4\src\ch4_comm.h at line 89: 0 No backtrace info available Abort(1) on node 0: Internal error
Inspecting the mpich source code (https://github.com/pmodels/mpich, tag v3.3b2) of src/mpid/ch4/src/ch4_comm.h shows the following (lines 88-97).
MPL_STATIC_INLINE_PREFIX int MPID_Comm_revoke(MPIR_Comm * comm_ptr, int is_remote) { MPIR_FUNC_VERBOSE_STATE_DECL(MPID_STATE_MPID_COMM_REVOKE); MPIR_FUNC_VERBOSE_ENTER(MPID_STATE_MPID_COMM_REVOKE); MPIR_Assert(0); MPIR_FUNC_VERBOSE_EXIT(MPID_STATE_MPID_COMM_REVOKE); return 0; }
If I comment out the part of the OpenCoarrays-2.2.0 build system (in src/mpi/CMakeFiles.txt) that adds the -DUSE_FAILED_IMAGES definition when building libcaf_mpi.so, then 44 of the first 51 OpenCoarrays-2.2.0 test cases pass with Intel MPI 2019 (none of which use failed images) proving the concept that Intel MPI could work. All 78 tests (including those using failed images) pass with mpich-3.3b3, but mpich is the OpenCoarrays development MPI.
I would like to learn more about this assertion failure, and how the '#ifdef USE_FAILED_IMAGES' in OpenCoarrays-2.2.0/src/mpi/mpi_caf.c interact to cause this assertion to fail. I also wanted to bring this to the Intel MPI developer(s) attention as they work toward release of 2019, Update 1.