Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Invalid communicator issue with PMPI_Allreduce

$
0
0

Operating system and version: CentOS Linux release 7.5.1804
Intel MPI version: 2019.5.281
Compiler and version: 19.0.5.281
Fabric: Mellanox Technologies MT27500
Libfabric version: 1.7.2

Would anyone be able to help me with an "invalid communicator" error I've been getting with Intel MPI plus Intel compilers (not present with OpenMPI plus GNU or Intel compilers) in one subroutine in a large code?

I receive the error when I use MPI_ALLREDUCE in this subroutine, but if I replace it with an MPI_REDUCE followed by an MPI_BCAST the code works fine. There are many other instances of MPI_ALLREDUCE in other subroutines that seem to work fine. The snippet that works:

     CALL MPI_REDUCE(EX,FEX,NATOMX,MPI_REAL8,MPI_SUM,0, &
       COMM_CHARMM,IERROR)
     CALL MPI_REDUCE(EY,FEY,NATOMX,MPI_REAL8,MPI_SUM,0, &
       COMM_CHARMM,IERROR)
     CALL MPI_REDUCE(EZ,FEZ,NATOMX,MPI_REAL8,MPI_SUM,0, &
       COMM_CHARMM,IERROR)
     CALL MPI_BCAST(FEX,NATOMX,MPI_REAL8,0,COMM_CHARMM,IERROR)
     CALL MPI_BCAST(FEY,NATOMX,MPI_REAL8,0,COMM_CHARMM,IERROR)
     CALL MPI_BCAST(FEZ,NATOMX,MPI_REAL8,0,COMM_CHARMM,IERROR)

The snippet that causes the error:

     CALL MPI_ALLREDUCE(EX,FEX,NATOMX,MPI_REAL8,MPI_SUM,0, &
       COMM_CHARMM,IERROR)
     CALL MPI_ALLREDUCE(EY,FEY,NATOMX,MPI_REAL8,MPI_SUM,0, &
       COMM_CHARMM,IERROR)
     CALL MPI_ALLREDUCE(EZ,FEZ,NATOMX,MPI_REAL8,MPI_SUM,0, &
       COMM_CHARMM,IERROR)

After setting I_MPI_DEBUG=6, I_MPI_HYDRA_DEBUG=on, the error message is:

Abort(1007228933) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Allreduce: Invalid communicator, error stack:
PMPI_Allreduce(434): MPI_Allreduce(sbuf=0x2b5015d1b6c0, rbuf=0x2b5004ff8740, count=1536, datatype=dtype=0x4c000829, op=MPI_SUM, comm=comm=0x0) failed
PMPI_Allreduce(355): Invalid communicator

The problem persists while using only a single core with mpirun. The initial MPI debug output then is:

$ mpirun -ppn 1 -n 1 ../build/cmake/charmm-bug -i c45test/dcm-ti.inp

[mpiexec@pc-beethoven.cluster] Launch arguments: /opt/intel-2019/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host pc-beethoven.cluster --upstream-port 36326 --pgid 0 --launcher ssh --launcher-number 0 --base-path /opt/intel-2019/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /opt/intel-2019/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[proxy:0:0@pc-beethoven.cluster] pmi cmd from fd 6: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0@pc-beethoven.cluster] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@pc-beethoven.cluster] pmi cmd from fd 6: cmd=get_maxes
[proxy:0:0@pc-beethoven.cluster] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0@pc-beethoven.cluster] pmi cmd from fd 6: cmd=get_appnum
[proxy:0:0@pc-beethoven.cluster] PMI response: cmd=appnum appnum=0
[proxy:0:0@pc-beethoven.cluster] pmi cmd from fd 6: cmd=get_my_kvsname
[proxy:0:0@pc-beethoven.cluster] PMI response: cmd=my_kvsname kvsname=kvs_24913_0
[proxy:0:0@pc-beethoven.cluster] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:0@pc-beethoven.cluster] PMI response: cmd=barrier_out
[0] MPI startup(): libfabric version: 1.7.2a-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[proxy:0:0@pc-beethoven.cluster] pmi cmd from fd 6: cmd=get_my_kvsname
[proxy:0:0@pc-beethoven.cluster] PMI response: cmd=my_kvsname kvsname=kvs_24913_0
[proxy:0:0@pc-beethoven.cluster] pmi cmd from fd 6: cmd=put kvsname=kvs_24913_0 key=bc-0 value=mpi#0200ADFEC0A864030000000000000000$
[proxy:0:0@pc-beethoven.cluster] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0@pc-beethoven.cluster] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:0@pc-beethoven.cluster] PMI response: cmd=barrier_out
[proxy:0:0@pc-beethoven.cluster] pmi cmd from fd 6: cmd=get kvsname=kvs_24913_0 key=bc-0
[proxy:0:0@pc-beethoven.cluster] PMI response: cmd=get_result rc=0 msg=success value=mpi#0200ADFEC0A864030000000000000000$
[0] MPI startup(): Rank    Pid      Node name             Pin cpu
[0] MPI startup(): 0       24917    pc-beethoven.cluster  {0,1,2,3,4,5,6,7}
[0] MPI startup(): I_MPI_CC=icc
[0] MPI startup(): I_MPI_CXX=icpc
[0] MPI startup(): I_MPI_F90=ifort
[0] MPI startup(): I_MPI_F77=ifort
[0] MPI startup(): I_MPI_ROOT=/opt/intel-2019/compilers_and_libraries_2019.5.281/linux/mpi
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_DEBUG=on
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=6

Note that I also tried using FI_PROVIDER=sockets, with the same result. Any ideas?


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>