Dear all,
I'd like to run different MPI-processes in a server/client-setup, as explained in:
https://software.intel.com/en-us/articles/using-the-intel-mpi-library-in-a-serverclient-setup
I have attached the two programs that I used as test:
- attach.c opens a port and calls MPI_Comm_attach
- connect.c calls MPI_Comm_connect and expects the port as argument
- when the connection is setup, MPI_Allreduce is used to sum some integers
everything works fine when I start these programs interactively with mpirun:
[donners@int2 openport]$ mpirun -n 1 ./accept_c & [3] 9575 [donners@int2 openport]$ ./accept_c: MPI_Open_port.. ./accept_c: mpiport=tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$ ./accept_c: MPI_Comm_Accept.. [3]+ Stopped mpirun -n 1 ./accept_c [donners@int2 openport]$ mpirun -n 1 ./connect_c 'tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$' ./connect_c: Port name entered: tag#0$rdma_port0#9585$rdma_host0#0A000010000071F1FE800000000000000002C9030019455100000004$arch_code#6$ ./connect_c: MPI_Comm_connect.. ./connect_c: Size of intercommunicator: 1 ./connect_c: intercomm, MPI_Allreduce.. ./accept_c: Size of intercommunicator: 1 ./accept_c: intercomm, MPI_Allreduce.. ./connect_c: intercomm, my_value=7 SUM=8 ./accept_c: intercomm, my_value=8 SUM=7 ./accept_c: intracomm, MPI_Allreduce.. ./accept_c: intracomm, my_value=8 SUM=15 Done ./accept_c: Done ./connect_c: intracomm, MPI_Allreduce.. ./connect_c: intracomm, my_value=7 SUM=15 Done
However, it fails when started by SLURM. The job script looks like:
#!/bin/bash #SBATCH -n 2 export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so tmp=$(mktemp) srun -l -n 1 ./accept_c 2>&1 | tee $tmp & until [ "$port" != "" ];do port=$(cat $tmp|fgrep mpiport|cut -d= -f2-) echo "Found port: $port" sleep 1 done srun -l -n 1 ./connect_c "$port"<<EOF $port EOF
The output is:
Found port: 0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: MPI_Open_port.. 0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: mpiport=tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$ 0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: MPI_Comm_Accept.. Found port: tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$ output connect_c: /scratch/nodespecific/srv4/donners.2300217/tmp.rQY9kHs8HS 0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: Port name entered: tag#0$rdma_port0#19635$rdma_host0#0A00000700003F67FE800000000000000002C9030019453100000004$arch_code#0$ 0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: MPI_Comm_connect.. 0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: Size of intercommunicator: 1 0: /nfs/home1/donners/Tests/mpi/openport/./accept_c: intercomm, MPI_Allreduce.. 0: Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 206: ptr && ptr == (char*) MPIDI_Process.my_pg->id 0: internal ABORT - process 0 0: In: PMI_Abort(1, internal ABORT - process 0) 0: slurmstepd: *** STEP 2300217.0 ON srv4 CANCELLED AT 2016-07-28T13:41:40 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. 0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: Size of intercommunicator: 1 0: /nfs/home1/donners/Tests/mpi/openport/./connect_c: intercomm, MPI_Allreduce.. 0: Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 206: ptr && ptr == (char*) MPIDI_Process.my_pg->id 0: internal ABORT - process 0 0: In: PMI_Abort(1, internal ABORT - process 0) 0: slurmstepd: *** STEP 2300217.1 ON srv4 CANCELLED AT 2016-07-28T13:41:40 ***
The server and client do connect, but fail when communication starts. This looks like a bug in the MPI-library.
Could you let me know if this is the case, or if this use is not supported by Intel MPI?
With regards,
John