Hi,
When we run MPI job, found following error and job failed.
Here is MPI command:
mpirun -genv I_MPI_FABRICS shm:dapl -f mpi_hosts -perhost 48 -n 288 /path/binary
===================================================================================
...
alps8-21.cluster.nchc.org.tw:CMA:c7b:3327f700: 27333960 us(4008027 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c7f:652af700: 27843555 us(4008028 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c81:f86aa700: 27490415 us(4008032 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c83:6ed48700: 27477294 us(4008028 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c85:98223700: 27433706 us(4008031 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c69:ecb71700: 27398107 us(4008228 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-25.cluster.nchc.org.tw:CMA:80c2:435aa700: 27601304 us(4004013 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.7.46 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c6b:245aa700: 28623785 us(5187228 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c57:a5b87700: 29139082 us(4004020 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-22.cluster.nchc.org.tw:CMA:a80c:82624700: 30619141 us(4005003 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.21 retry (9)..
alps8-22.cluster.nchc.org.tw:CMA:a80e:8ab41700: 30623757 us(4005005 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.21 retry (9)..
alps8-22.cluster.nchc.org.tw:CMA:a812:54b45700: 30645431 us(4005010 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.21 retry (9)..
===================================================================================
Could you please give some clue?
Bruce