Hi, we're testing intel mpi on Centos7.5 with infiniband connections.
Using intel mpi benchmark, small scale tests (10node, 400 mpi ranks) looks OK while 100 nodes (4000 ranks) job crashes. FI_LOG_LEVEL=debug yielded a following message:
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:ofi_rxm:ep_ctrl:rxm_eq_sread():575<warn> fi_eq_readerr: err: 111, prov_err: Unknown error -28 (-28)
libfabric:verbs:fabric:fi_ibv_set_default_attr():1085<info> Ignoring provider default value for tx rma_iov_limit as it is greater than the value supported by domain: mlx5_0
Would there be any way to trace the cause of the issues? Any comments are appreciated.
Thanks,
BJ