Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

IntelMPI DAPL Question

$
0
0

Dear MPI team,

 

I started receiving these messages from a node after I restarted a slowly moving MPI job.

I can tell these originate from IntelMPI. Do you have any suggestions as to what may be triggering them?

 

gl0396:SCM:4a7f:aaae7d40: 18 us(18 us):  open_hca: device mlx4_0 not found
gl0396:SCM:4a7f:aaae7d40: 16 us(16 us):  open_hca: device mlx4_0 not found
gl0397:UCM:493a:aaae7d40: 48102 us(48102 us):  create_ah: ERR Invalid argument
[359:gl0397][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:247] error(0x30000): ofa-v2-mlx5_0-1u: could not connect DAPL endpoints: DAT_INSUFFICIENT_RESOURCES()
gl0397:UCM:493a:aaae7d40: 48130 us(28 us): UCM connect: snd ERR -> cm_lid 0 cm_qpn ac1009c0 r_psp 4a7f p_sz=24
[356:gl0394][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:247] error(0x30000): ofa-v2-mlx5_0-1u: could not connect DAPL endpoints: DAT_INSUFFICIENT_RESOURCES()

 

 

Thank you!

Michael


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>