I run a 16 node 256 core Dell cluster running on Redhat Enterprise Linux.
Our primary use is to run the engineering software LSTC LS-Dyna. With a recent change in LSTC licensing the newest versions of the software we want to now will only run using IntelMPI (previously we used PlatformMPI).
I cannot however now seem to get the PBS job submission script to work with InelMPI that used to work with PlatformMPI.
The submission script reads (with the last line being the submission line for the L-Dyna testjob.k):
#!/bin/bash
#PBS -l select=8:ncpus=16:mpiprocs=16
#PBS -j oe
cd $PBS_JOBDIR
echo "starting dyna .. "
machines=$(sort -u $PBS_NODEFILE)
ml=""
for m in $machines
do
nproc=$(grep $m $PBS_NODEFILE | wc -l)
sm=$(echo $m | cut -d'.' -f1)
if [ "$ml" == "" ]
then
ml=$sm:$nproc
else
ml=$ml:$sm:$nproc
fi
done
echo Machine line: $ml
echo PBS_O_WORKDIR=$PBS_O_WORKDIR
echo "Current directory is:"
pwd
echo "machines"
/opt/intel/impi/2018.4.274/intel64/bin/mpirun -machines $ml /usr/local/ansys/v170/ansys/bin/linx64/ls-dyna_mpp_s_R11_1_0_x64_centos65_ifort160_sse2_intelmpi-2018 i=testjob.k pr=dysmp
When i attempt to run this job via PBS job manager and I look into the standard error file I see:
[mpiexec@gpunode03.hpc.internal] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@gpunode03.hpc.internal] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:253): unable to write data to proxy
[mpiexec@gpunode03.hpc.internal] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:176): unable to send signal downstream
[mpiexec@gpunode03.hpc.internal] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@gpunode03.hpc.internal] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event
[mpiexec@gpunode03.hpc.internal] main (../../ui/mpich/mpiexec.c:1157): process manager error waiting for completion
I know I can submit a job manually (no PBS involved) and it will run on a node of the cluster ok using the IntelMPI.
So I have boiled the issue down to the section of the submission line that says -machines $ml to do with the node allocation.
For some reason IntelMPI does not accept this syntax whereas PlatormMPI did?
I am quite stumped here and any advice would be greatly appreciated.
Thanks.
Richard.