Im having some issues with the latest Intel mpi and efa on AWS instances.
I installed the Intel MPI from the install script found elsewhere in the support forums.
(https://software.intel.com/sites/default/files/managed/f4/92/install_imp...)
I grabbed the latest libfabric source and built that.
The instance already had AWS's libfabric from the efa setup install but its not in PATH/LD_LIBRARY_PATH for these tests.
[sheistan@compute-041249 ~]$ which mpiexec
/opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpiexec
[sheistan@compute-041249 ~]$ which fi_info
/nasa/libfabric/latest/bin/fi_info
[sheistan@compute-041249 ~]$ fi_info -p efa
provider: efa
fabric: EFA-fe80::4c2:2aff:fec7:ce80
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::4c2:2aff:fec7:ce80
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::4c2:2aff:fec7:ce80
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
when running on a single node life is good as expected:
[sheistan@compute-041249 ~]$ I_MPI_DEBUG=1 mpiexec --hostfile $PBS_NODEFILE -np 2 ./pi_efa
[0] MPI startup(): libfabric version: 1.9.0a1
[0] MPI startup(): libfabric provider: efa
compute-041249
compute-041249
pi is approximately: 3.1415926769620652 Relative Error is: -0.20387909E-05
Integration Wall Time = 0.005503 Seconds on 2 Processors for n = 10000000
but when two nodes are involved it hangs. In this case in a mpi_barrier() call.
[sheistan@compute-041249 ~]$ I_MPI_DEBUG=1 mpiexec --hostfile $PBS_NODEFILE -np 2 -ppn 1 ./pi_efa
[0] MPI startup(): libfabric version: 1.9.0a1
[0] MPI startup(): libfabric provider: efa
compute-041116
compute-041249
^C[mpiexec@compute-041249] Sending Ctrl-C to processes as requested
[mpiexec@compute-041249] Press Ctrl-C again to force abort
forrtl: error (69): process interrupted (SIGINT)
Image PC Routine Line Source
pi_efa 0000000000404724 Unknown Unknown Unknown
libc-2.26.so 00007F80892447E0 Unknown Unknown Unknown
libfabric.so.1.11 00007F8088D5A007 Unknown Unknown Unknown
libfabric.so.1.11 00007F8088D5B170 Unknown Unknown Unknown
libfabric.so.1.11 00007F8088D0B38D Unknown Unknown Unknown
libfabric.so.1.11 00007F8088D0A72E Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A7CAC90 Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A0EBF7B Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A7F5B2F Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A3F42C0 Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A052F7C Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A054138 Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A17F4E2 Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A05446E MPI_Barrier Unknown Unknown
libmpifort.so.12. 00007F808AF1573C pmpi_barrier Unknown Unknown
pi_efa 0000000000402EF9 Unknown Unknown Unknown
pi_efa 0000000000402E52 Unknown Unknown Unknown
libc-2.26.so 00007F808923102A __libc_start_main Unknown Unknown
pi_efa 0000000000402D6A Unknown Unknown Unknown
Thoughts on something to try or something I missed?
thanks
s