Hi All,
I have two Dell R815 server with 4 AMD opteron 6380 (16 cores each) connected directly by two infiniband cards. I have trouble running the IMB-MPI1 test even on a single node:
mpirun -n 2 -genv I_MPI_DEBUG=3 -genv I_MPI_FABRICS=ofi /opt/intel/impi/2019.5.281/intel64/bin/IMB-MPI1
The run aborted with the following error:
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 2.23 0.00
1 1000 2.24 0.45
2 1000 2.25 0.89
4 1000 2.26 1.77
8 1000 2.24 3.57
16 1000 2.25 7.12
32 1000 2.27 14.08
64 1000 2.43 26.33
128 1000 2.55 50.26
256 1000 3.60 71.08
512 1000 4.12 124.40
1024 1000 5.04 203.00
2048 1000 6.89 297.38
4096 1000 10.56 387.76
8192 1000 13.98 585.83
16384 1000 22.74 720.65
32768 1000 30.12 1087.81
65536 640 46.17 1419.45
131072 320 76.43 1714.87
262144 160 334.23 784.32
524288 80 511.22 1025.57
1048576 40 850.76 1232.51
2097152 20 1518.37 1381.19
Abort(941742351) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Send: Other MPI error, error stack:
PMPI_Send(155)............: MPI_Send(buf=0x3a100f0, count=4194304, MPI_BYTE, dest=1, tag=1, comm=0x84000003) failed
MPID_Send(572)............:
MPIDI_send_unsafe(203)....:
MPIDI_OFI_send_normal(414):
(unknown)(): Other MPI error
However, it runs fine with shm:
mpirun -n 2 -genv I_MPI_DEBUG=3 -genv I_MPI_FABRICS=shm /opt/intel/impi/2019.5.281/intel64/bin/IMB-MPI1
Try to run with 2 CPUs on two different nodes also fail at 4M message size.
I have been struggling with this for a few days now without success. Any suggestions where to look at or what to try?
Thanks!
Qi