Hi,
I am trying to setup RDMA between two Azure VMs using Intel's MPI library (v5.1.1.109). Both machines can remotely connect to the other machine using ssh and using the pingpong utility in the following way, I can get latency numbers without any errors.
/opt/intel/impi/5.1.1.109/bin64/mpirun -hosts 10.0.0.5,10.0.0.6 -ppn 1 -n 2 -env I_MPI_FABRICS tcp -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 /opt/intel/impi/5.1.1.109/bin64/IMB-MPI1 pingpong
However, if I try to run the pingpong utility to get latency numbers for RDMA over IB, I will get the following error:
/opt/intel/impi/5.1.1.109/bin64/mpirun -hosts 10.0.0.5,10.0.0.6 -ppn 1 -n 2 -env I_MPI_FABRICS dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 /opt/intel/impi/5.1.1.109/bin64/IMB-MPI1 pingpong active-copy-1:b9a:438d8700: 4006660 us(4006660 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (15).. active-copy-1:b9a:438d8700: 8014586 us(4007926 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (14).. active-copy-1:b9a:438d8700: 12022590 us(4008004 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (13).. active-copy-1:b9a:438d8700: 16030610 us(4008020 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (12).. active-copy-1:b9a:438d8700: 20030594 us(3999984 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (11).. active-copy-1:b9a:438d8700: 24038599 us(4008005 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (10).. active-copy-1:b9a:438d8700: 28046628 us(4008029 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (9).. active-copy-1:b9a:438d8700: 32054598 us(4007970 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (8).. active-copy-1:b9a:438d8700: 36062598 us(4008000 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (7).. active-copy-1:b9a:438d8700: 40070580 us(4007982 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (6).. active-copy-1:b9a:438d8700: 44078630 us(4008050 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (5).. active-copy-1:b9a:438d8700: 48086598 us(4007968 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (4).. active-copy-1:b9a:438d8700: 52094611 us(4008013 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (3).. active-copy-1:b9a:438d8700: 56102588 us(4007977 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (2).. active-copy-1:b9a:438d8700: 60110613 us(4008025 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (1).. active-copy-1:b9a:438d8700: 60110625 us(12 us): dapl_cma_active: ARP_ERR, retries(15) exhausted -> DST 172.16.1.193,3313 [0:10.0.0.5] unexpected DAPL event 0x4008 Fatal error in MPI_Init: Internal MPI error!, error stack: MPIR_Init_thread(784): MPID_Init(1326)......: channel initialization failed MPIDI_CH3_Init(141)..: (unknown)(): Internal MPI error!
I tried disabling the firewall and running the utility from the other machine, but neither worked! But, if I set both hosts to the IP address of the local machine, I will get the latency numbers. I suspect there is something wrong with the interface or the way these machines try to find each other, but I have no idea what could be the fix. Any idea what is going wrong here?