Hello,
I am compiling and running a massive electronic structure program on an NSF supercomputer. I am compiling with the intel/15.0.2 Fortran compiler and impi/5.0.2, the latest-installed Intel MPI library.
The program has hybrid parallelization (MPI and OpenMP). When I run the program on a molecule using 4 MPI tasks on a single node (no OpenMP threading anywhere here), I obtain the correct result.
However, when I spread out the 4 tasks on 2 nodes (still 4 total tasks, just 2 on each node), I get what seem to be numerical-/precision-related errors.
Following Intel's Michael Steyer's slides on Intel MPI conditional reproducibility (http://goparallel.sourceforge.net/wp-content/uploads/2015/06/PUM21-3-Int...), I specified that all collective operations be run using topology-unaware algorithms by running mpiexec.hydra with the following flags:
-genv I_MPI_DEBUG 100
-genv I_MPI_ADJUST_ALLGATHER 1
-genv I_MPI_ADJUST_ALLGATHERV 1
-genv I_MPI_ADJUST_ALLREDUCE 2
-genv I_MPI_ADJUST_ALLTOALL 1
-genv I_MPI_ADJUST_ALLTOALLV 1
-genv I_MPI_ADJUST_ALLTOALLW 1
-genv I_MPI_ADJUST_BARRIER 1
-genv I_MPI_ADJUST_BCAST 1
-genv I_MPI_ADJUST_EXSCAN 1
-genv I_MPI_ADJUST_GATHER 1
-genv I_MPI_ADJUST_GATHERV 1
-genv I_MPI_ADJUST_REDUCE 1
-genv I_MPI_ADJUST_REDUCE_SCATTER 1
-genv I_MPI_ADJUST_SCAN 1
-genv I_MPI_ADJUST_SCATTER 1
-genv I_MPI_ADJUST_SCATTERV 1
-genv I_MPI_ADJUST_REDUCE_SEGMENT 1:14000
-genv I_MPI_STATS_SCOPE "topo"
-genv I_MPI_STATS "ipm"
This helps my job to proceed further than it did before; however, it still dies with what seems like numerical-/precision-related errors.
My question is: What other topology-aware settings are there, so that I can try to disable them and therefore obtain the correct results that I achieve when the MPI tasks run on only a single node? I have pored through the Intel MPI manual and haven't seen anything other than the above.
Please note that sometimes using multiple nodes works, e.g., if I use 2 MPI tasks total spread over two nodes. It really seems to me to be a strange topology issue. Another note, compiling+running with the latest versions of OpenMPI and MVAPICH2 both consistently die with seg faults, so using those libraries isn't really an option here. I obtain the same issues/results no matter what nodes have been allocated to me, and I have tested this many times.
Thank you very much in advance for your help!
Best,
Andrew