Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

MPI job submitted with TORQUE does not use InfiniBand if running and start nodes overlap

$
0
0

Hi there,

I am using Intel MPI (4.1.0.024 from ICS 2013.0.028) to run my parallel application (Gromacs 4.6.1 molecular dynamics) on a SGI cluster with CentOS 6.2 and Torque 2.5.12.

When I submitt a MPI job with Torque to start and run on 2 nodes, MPI startup fails to negotiate with Infiniband (IB) and internode communication falls back to Ethernet. This is my job script:

#PBS -l nodes=n001:ppn=32+n002:ppn=32
#PBS -q normal
source /opt/intel/impi/4.1.0.024/bin64/mpivars.sh
source /opt/progs/gromacs/bin/GMXRC.bash
cd $PBS_O_WORKDIR/
mpiexec.hydra -machinefile macs -np 64 mdrun_mpi >& md.out

and this is the output:
[54] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
....
[45] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
....
[33] MPI startup(): DAPL provider <NULLstring> on rank 0:n001 differs from ofa-v2-mlx4_0-1(v2.0) on rank 33:n002
...
[0] MPI startup(): shm and tcp data transfer modes

However, MPI negotiates fine with IB if I run the same mpiexec.hydra
line from the console either logged to n001 (one of the running nodes)
or logged in another, say the admin, node. It also works fine if I
submitt the TORQUE job using a different start node than the running
nodes (-machinefile macs points to n001 and n002), say using #PBS -l
nodes=n003 and the rest identical to as above. This a succesfull (IB)
output:

[55] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
...
[29] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
...
[0] MPI startup(): shm and dapl data transfer modes
...

Any tips on what is going wrong? PLs, let me know if you need more info. This has also been posted to the TORQUE user list, but your help is welcome, too.

Cheers,

Guilherme


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>