Hi there,
I am using Intel MPI (4.1.0.024 from ICS 2013.0.028) to run my parallel application (Gromacs 4.6.1 molecular dynamics) on a SGI cluster with CentOS 6.2 and Torque 2.5.12.
When I submitt a MPI job with Torque to start and run on 2 nodes, MPI startup fails to negotiate with Infiniband (IB) and internode communication falls back to Ethernet. This is my job script:
#PBS -l nodes=n001:ppn=32+n002:ppn=32
#PBS -q normal
source /opt/intel/impi/4.1.0.024/bin64/mpivars.sh
source /opt/progs/gromacs/bin/GMXRC.bash
cd $PBS_O_WORKDIR/
mpiexec.hydra -machinefile macs -np 64 mdrun_mpi >& md.out
and this is the output:
[54] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
....
[45] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
....
[33] MPI startup(): DAPL provider <NULLstring> on rank 0:n001 differs from ofa-v2-mlx4_0-1(v2.0) on rank 33:n002
...
[0] MPI startup(): shm and tcp data transfer modes
However, MPI negotiates fine with IB if I run the same mpiexec.hydra
line from the console either logged to n001 (one of the running nodes)
or logged in another, say the admin, node. It also works fine if I
submitt the TORQUE job using a different start node than the running
nodes (-machinefile macs points to n001 and n002), say using #PBS -l
nodes=n003 and the rest identical to as above. This a succesfull (IB)
output:
[55] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
...
[29] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
...
[0] MPI startup(): shm and dapl data transfer modes
...
Any tips on what is going wrong? PLs, let me know if you need more info. This has also been posted to the TORQUE user list, but your help is welcome, too.
Cheers,
Guilherme