Dear MPI team,
I started receiving these messages from a node after I restarted a slowly moving MPI job.
I can tell these originate from IntelMPI. Do you have any suggestions as to what may be triggering them?
gl0396:SCM:4a7f:aaae7d40: 18 us(18 us): open_hca: device mlx4_0 not found gl0396:SCM:4a7f:aaae7d40: 16 us(16 us): open_hca: device mlx4_0 not found gl0397:UCM:493a:aaae7d40: 48102 us(48102 us): create_ah: ERR Invalid argument [359:gl0397][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:247] error(0x30000): ofa-v2-mlx5_0-1u: could not connect DAPL endpoints: DAT_INSUFFICIENT_RESOURCES() gl0397:UCM:493a:aaae7d40: 48130 us(28 us): UCM connect: snd ERR -> cm_lid 0 cm_qpn ac1009c0 r_psp 4a7f p_sz=24 [356:gl0394][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:247] error(0x30000): ofa-v2-mlx5_0-1u: could not connect DAPL endpoints: DAT_INSUFFICIENT_RESOURCES()
Thank you!
Michael