Hello,
I have provided an ifort + intelMPI build of a CFD solver to a customer using the Intel MPI runtime environment. The customer's cluster uses a SGE scheduler. The user is attempting to run on a single node with mpirun using the default Hydra process manager, but the job hangs without any meaningful error messages. I instructed the user to run with I_MPI_HYDRA_DEBUG=1 and I_MPI_DEBUG=6 but I don't see anything unusual or unexpected in the output. The last lines in the output are the following:
[mpiexec@compute-31] Launch arguments: /cm/shared/apps/sge/current/bin/linux-x64/qrsh -inherit -V compute-31.cm.cluster /a/fine/10.2/fine102/LINUX/_mpi/_impi5.0.3/intel64/bin/pmi_proxy --control-port compute-31.cm.cluster:46905 --debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk user --launcher sge --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1687715305 --usize -2 --proxy-id 0
[mpiexec@compute-31] STDIN will be redirected to 1 fd(s): 9
I ran a similar test on another cluster which worked correctly. The following lines with the [proxy:0:0@nmc-0066] prefix appear directly after the last lines reported by the customer cluster.
[mpiexec@nmc-0066] STDIN will be redirected to 1 fd(s): 9
[proxy:0:0@nmc-0066] Start PMI_proxy 0
[proxy:0:0@nmc-0066] STDIN will be redirected to 1 fd(s): 9
From these tests I think the hang is coming from pmi_proxy. Are there any additional verbose or debugging modes that could help identify the problem on the user's cluster?
Thank your for your help,
-David