Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

mpd daemon prematurely terminating job

$
0
0

Hi everyone,

I am a little out of my depth here so bear with me. I am trying to configure mpirun and mpiexec to run software called Materials Studio on a 1 node, 2 processor, 12 core cluster. The submission scheme is PBS. I had everything set up properly and where I could submit jobs and they would work well but after a few days I ran into issues where I would get this sort of error:

mpiexec_server.org: cannot connect to local mpd (/tmp/mpd2.console_user); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option)

It seemed like the daemon for mpd was somehow set up but eventually terminated. I had luck adding this to my submission script:

export PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/bin:$PATH
export LD_LIBRARY_PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/lib:/data1/opt/MD/Linux-x86_64/IntelMPI/bin:/data1/opt/MD/Linux-x86_64/IntelMKL/lib
mpdboot -n 1 -f ~/mpd.hosts
nohup mpd &
/data1/opt/MD/Linux-x86_64/IntelMPI/bin/mpiexec -n 6 /data1/opt/MD/2.0/TaskServer/Tools/vasp5.3.3/Linux-x86_64/vasp_parallel

The job now submits and runs properly but times out after 30 minutes or so. I tried adding '-r ssh' without quotes to the end of the mpdboot line but I am not sure if that is the right strategy to take. Also, I am a little confused about why I need to run this daemon in this script and why I need to call a hosts file when I run- I thought that PBS creates that when the job picks up. Could anyone please give me some advice on where to go next? Basically how can I prevent a job that is running from quitting because of something to do with the mpi daemon. 

Thanks so much for your help!


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>