Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

AWS, Intel mpi and efa

$
0
0

Im having some issues with the latest Intel mpi and efa on AWS instances.

I installed the Intel MPI from the install script found elsewhere in the support forums.

(https://software.intel.com/sites/default/files/managed/f4/92/install_imp...)

I grabbed the latest libfabric source and built that.

The instance already had AWS's libfabric from the efa setup install but its not in PATH/LD_LIBRARY_PATH for these tests.

[sheistan@compute-041249 ~]$ which mpiexec
/opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpiexec
[sheistan@compute-041249 ~]$ which fi_info
/nasa/libfabric/latest/bin/fi_info
[sheistan@compute-041249 ~]$ fi_info -p efa
provider: efa
    fabric: EFA-fe80::4c2:2aff:fec7:ce80
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::4c2:2aff:fec7:ce80
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::4c2:2aff:fec7:ce80
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD
 

when running on a single node life is good as expected:

[sheistan@compute-041249 ~]$ I_MPI_DEBUG=1 mpiexec --hostfile $PBS_NODEFILE -np 2  ./pi_efa
[0] MPI startup(): libfabric version: 1.9.0a1

[0] MPI startup(): libfabric provider: efa

compute-041249
compute-041249
  pi is approximately:  3.1415926769620652  Relative Error is:  -0.20387909E-05
 Integration Wall Time = 0.005503 Seconds on       2 Processors for n =  10000000
 

but when two nodes are involved it hangs. In this case in a mpi_barrier() call.

[sheistan@compute-041249 ~]$ I_MPI_DEBUG=1 mpiexec --hostfile $PBS_NODEFILE -np 2 -ppn 1 ./pi_efa
[0] MPI startup(): libfabric version: 1.9.0a1

[0] MPI startup(): libfabric provider: efa

compute-041116
compute-041249
^C[mpiexec@compute-041249] Sending Ctrl-C to processes as requested
[mpiexec@compute-041249] Press Ctrl-C again to force abort
forrtl: error (69): process interrupted (SIGINT)
Image              PC                Routine            Line        Source
pi_efa             0000000000404724  Unknown               Unknown  Unknown
libc-2.26.so       00007F80892447E0  Unknown               Unknown  Unknown
libfabric.so.1.11  00007F8088D5A007  Unknown               Unknown  Unknown
libfabric.so.1.11  00007F8088D5B170  Unknown               Unknown  Unknown
libfabric.so.1.11  00007F8088D0B38D  Unknown               Unknown  Unknown
libfabric.so.1.11  00007F8088D0A72E  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A7CAC90  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A0EBF7B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A7F5B2F  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A3F42C0  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A052F7C  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A054138  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A17F4E2  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A05446E  MPI_Barrier           Unknown  Unknown
libmpifort.so.12.  00007F808AF1573C  pmpi_barrier          Unknown  Unknown
pi_efa             0000000000402EF9  Unknown               Unknown  Unknown
pi_efa             0000000000402E52  Unknown               Unknown  Unknown
libc-2.26.so       00007F808923102A  __libc_start_main     Unknown  Unknown
pi_efa             0000000000402D6A  Unknown               Unknown  Unknown
 

Thoughts on something to try or something I missed?

 

thanks

 

s

 

 

 


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>