I'm running on an IBM cluster with nodes that have dual socket Ivy Bridge processors and 2 Nvidia K40 Tesla cards. I'm trying to run with 4 MPI ranks using Intel MPI 5 Update 2 with a single MPI rank for each socket. I'm trying to learn how to do this by using a simple MPI Hello World program that prints out the host name, rank and cpu ID. When I run with 2 MPI ranks, my simple program works as expected. When I run with 4 MPI ranks and use the mpirun that comes with Intel MPI, all 4 ranks run on the same node that I launched from. I am doing this interactively and get a set of two nodes using the following command:
qsub -I -l nodes=2,ppn=16 -q k20
I am using the following commands to run my program:
source /opt/intel/bin/compilervars.sh intel64; \
source /opt/intel/impi_latest/intel64/bin/mpivars.sh; \
export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u; \
/opt/intel/impi_latest/intel64/bin/mpirun -genv I_MPI_PIN=1 -genv I_MPI_PIN_DOMAIN=socket -n 4 hw_ibm_impi
If I use a different qsub command, i.e. qsub -I -l nodes=2,ppn=2 -q k20, the program runs as expected with 2 ranks on each node. But that does not seem the right way to get my node allocation if I want to also run threads from each MPI rank. Also, using my initial qsub command, I can run with 32 ranks and 16 ranks per host and the application runs as expected.
I can also try using the Intel mpiexec command instead of mpirun and I get the following result:
source /opt/intel/bin/compilervars.sh intel64; \
source /opt/intel/impi_latest/intel64/bin/mpivars.sh; \
export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u; \
/opt/intel/impi_latest/intel64/bin/mpiexec -genv I_MPI_PIN=1 -genv I_MPI_PIN_DOMAIN=socket -n 4 hw_ibm_impi
mpiexec_ibm-011: cannot connect to local mpd (/tmp/mpd2.console_username); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
Any ideas why this is not working? Am I not using I_MPI_PIN_DOMAIN correctly? Could there be something messed up with the Intel MPI installation on the cluster? Or some problem with the installation of the scheduler?