Hi,
I've installed Intel parallel studio cluster edition in single node installation configuration on the master node cluster of 8 nodes with 8 processors each. I've performed the pre-requisite steps before installation and verified shell connectivity also running the .sshconnectivity and creating machines.LINUX file which gave the result as suggesting all 8 nodes are found as follows:
*******************************************************************************
Node count = 8
Secure shell connectivity was established on all nodes.
See the log output listing "/tmp/sshconnectivity.aditya.log" for details.
Version number: $Revision: 259 $
Version date: $Date: 2012-06-11 23:26:12 +0400 (Mon, 11 Jun 2012) $
*******************************************************************************
machines.LINUX file has the following hostnames:
octopus100.ubi.pt
compute-0-0.local
compute-0-1.local
compute-0-2.local
compute-0-3.local
compute-0-4.local
compute-0-5.local
compute-0-6.local
I started the installation and installed all the modules in /export/apps/intel directory which can be accessed by all nodes as suggested by the administrator of the cluster. After completing the installation I've added the compilers environmental variable psxevar.sh and mpivars.sh to the bash script as advised in the getting started manual. I then prepared the hostfile with all the nodes of the cluster for running in the mpi environment and verifies the shell connectivity by running .sshconnectivity form the installation directory and it worked like earlier and detected all nodes successfully.
i wanted to check the cluster configuration, so I compiled and executed the test.c program in the mpi/test directory of the instalation. I compiled well but when I executed myprog it returned the error: /mpi/intel64/bin/pmi_proxy: No such file or directory found as follows:
[aditya@octopus100 Desktop]$ mpiicc -o myprog test.c
[aditya@octopus100 Desktop]$ mpirun -n 2 -ppn 1 -f ./hostfile ./myprog
Intel(R) Parallel Studio XE 2017 Update 4 for Linux*
Copyright (C) 2009-2017 Intel Corporation. All rights reserved.
bash: /export/apps/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/pmi_proxy: No such file or directory
^C[mpiexec@octopus100.ubi.pt] Sending Ctrl-C to processes as requested
[mpiexec@octopus100.ubi.pt] Press Ctrl-C again to force abort
[mpiexec@octopus100.ubi.pt] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@octopus100.ubi.pt] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@octopus100.ubi.pt] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@octopus100.ubi.pt] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@octopus100.ubi.pt] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@octopus100.ubi.pt] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion
later I referred trouble shooting manual then it suggested running a non-mpi for hostname and it returned the same error as follows:
[aditya@octopus100 Desktop]$ mpirun -ppn 1 -n 2 -hosts compute-0-0.local, compute-0-1.local hostname
Intel(R) Parallel Studio XE 2017 Update 4 for Linux*
Copyright (C) 2009-2017 Intel Corporation. All rights reserved.
bash: /export/apps/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/pmi_proxy: No such file or directory
^C[mpiexec@octopus100.ubi.pt] Sending Ctrl-C to processes as requested
[mpiexec@octopus100.ubi.pt] Press Ctrl-C again to force abort
[mpiexec@octopus100.ubi.pt] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@octopus100.ubi.pt] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@octopus100.ubi.pt] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@octopus100.ubi.pt] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@octopus100.ubi.pt] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@octopus100.ubi.pt] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion
When I included the master ode octopus100.ubi.pt it worked only for that node but the rest nodes are not able to run the mpi commands I guess. I think may it is an environmental problem as the cluster nodes are not able to perform mpi communications with the master node.
Please help me resolve this issue so that I can perform some simulations on the cluster.
Thanks,
Aditya