Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

intel mpi at 4000 ranks

$
0
0

Hi, we're testing intel mpi on Centos7.5 with infiniband connections.

Using intel mpi benchmark, small scale tests (10node, 400 mpi ranks)  looks OK while 100 nodes (4000 ranks) job crashes.  FI_LOG_LEVEL=debug yielded a following message:

libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:ofi_rxm:ep_ctrl:rxm_eq_sread():575<warn> fi_eq_readerr: err: 111, prov_err: Unknown error -28 (-28)
libfabric:verbs:fabric:fi_ibv_set_default_attr():1085<info> Ignoring provider default value for tx rma_iov_limit as it is greater than the value supported by domain: mlx5_0

Would there be any way to trace the cause of the issues? Any comments are appreciated.

Thanks,

BJ


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>