Hello,
We have some troubles on our cluster to use Intel MPI with PBSPro under a Kerberized environment.
The thing is PBSPro doesn't forward Kerberos tickets which prevents us to have a password-less ssh. Security officers rejects ssh keys without a passphrase, beside, we are expected to rely on Kerberos in order to connect through ssh.
As you can expect, a simple
mpirun -l -v -n $nb_procs "${PBS_O_WORKDIR}/echo-node.sh" # that simply calls bash builtin echo
fails because of pmi_proxy that hangs, and in the end the walltime is exceeded, and we observe:
[...] [mpiexec@node028.sis.cnes.fr] Launch arguments: /work/logiciels/rhall/intel/parallel_studio_xe_2017_u2/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin/pmi_proxy --control-port node028.sis.cnes.fr:41735 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1939201911 --usize -2 --proxy-id 0 [mpiexec@node028.sis.cnes.fr] Launch arguments: /bin/ssh -x -q node029.sis.cnes.fr /work/logiciels/rhall/intel/parallel_studio_xe_2017_u2/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin/pmi_proxy --control-port node028.sis.cnes.fr:41735 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1939201911 --usize -2 --proxy-id 1 [proxy:0:0@node028.sis.cnes.fr] Start PMI_proxy 0 [proxy:0:0@node028.sis.cnes.fr] STDIN will be redirected to 1 fd(s): 17 [0] node: 0 / / =>> PBS: job killed: walltime 23 exceeded limit 15 [mpiexec@node028.sis.cnes.fr] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor) [mpiexec@node028.sis.cnes.fr] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy [mpiexec@node028.sis.cnes.fr] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream [mpiexec@node028.sis.cnes.fr] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@node028.sis.cnes.fr] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event [mpiexec@node028.sis.cnes.fr] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion
If instead we log onto the master node, execute kinit, and then run mpirun, everything works fine. Except this isn't exactly an acceptable workaround.
I've tried to play with the fabrics as the nodes are also connected with infiband, but I had no luck there. If I'm not mistaken, pmi_proxy does require password-less ssh whatever fabrics we have. Am I right ?
BTW, I've also tried to play with Altair PBSPro's pbsdsh. I've observed that the parameters it expects are not compatible with the one fed by mpirun. Besides, even if I encapsulate pbsdsh, pmi_proxy still fails with a
[proxy:0:0@node028.sis.cnes.fr] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "node028.sis.cnes.fr" to "node028.sis.cnes.fr" (Connection refused) [proxy:0:0@node028.sis.cnes.fr] main (../../pm/pmiserv/pmip.c:461): unable to connect to server node028.sis.cnes.fr at port 49813 (check for firewalls!)
So. My question, is there a workaround? Something that I've missed? Every clue I can gather googling and experimenting points me towards "password-less ssh". So far the only workaround we've found consist in using another MPI framework :(
Regards,