Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

MPI program hangs in "MPI_Finalize"

$
0
0

Hi All,

I will explain the current situation and the attached file.

The MPI application performed with LSF is currently debugging due to a problem that does not terminate the operation. Currently, the code level suspects mpi_finalize, and it occurs randomly, not every time, so we need to check more about the occurrence conditions. I inquired about similar symptoms in MPI forum, but the result was not known as post went to the ticket in the middle.
Please check if it is a similar symptom.
https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technolog...

- Strace results of MPI executing hosts (I suspected this error "│ + 01:17:43 read(7 ")

----------------------------------------------------------------------------------------
duru0403 has 24 procs as below:

* Name/State       : pmi_proxy / State:    S (sleeping)[m
  PID/PPID         : 141955 / 141954
  Commandline      : **************/apps/intel/18.4/impi/2018.4.274/intel64/bin/pmi_proxy --control-port duru0374:37775 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk lsf --launcher lsf --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1000390395 --usize -2 --proxy-id -1
  CPU/MEMs_allowed : 0-95 / 0-3
  [<ffffffff96e56e55>] poll_schedule_timeout+0x55/0xb0
  [<ffffffff96e585dd>] do_sys_poll+0x48d/0x590
  [<ffffffff96e587e4>] SyS_poll+0x74/0x110
  [<ffffffff97374ddb>] system_call_fastpath+0x22/0x27
  [<ffffffffffffffff>] 0xffffffffffffffff
  Files            :
     Num of pipes: 26
     Num of sockets: 16
     Num of anon_inodes: 0
  Strace           :
     + /xshared/support/systrace/strace: Process 141955 attached
     + 01:17:43 restart_syscall(<... resuming interrupted poll ...>/xshared/support/systrace/strace: Process 141955 detached
     +  <detached ...>
  Num of subprocs  : 23
  │
  ├─Name/State       : ensda / State:    S (sleeping)[m
  │ PID/PPID         : 141959 / 141955
  │ Commandline      : **************
  │ CPU/MEMs_allowed : 0 / 0-3
  │ [<ffffffff972f5139>] unix_stream_read_generic+0x309/0x8e0
  │ [<ffffffff972f5804>] unix_stream_recvmsg+0x54/0x70
  │ [<ffffffff972186ec>] sock_aio_read.part.9+0x14c/0x170
  │ [<ffffffff97218731>] sock_aio_read+0x21/0x30
  │ [<ffffffff96e404d3>] do_sync_read+0x93/0xe0
  │ [<ffffffff96e40fb5>] vfs_read+0x145/0x170
  │ [<ffffffff96e41dcf>] SyS_read+0x7f/0xf0
  │ [<ffffffff97374ddb>] system_call_fastpath+0x22/0x27
  │ [<ffffffffffffffff>] 0xffffffffffffffff
  │ Files            :
  │    -  > /dev/infiniband/uverbs0
  │    -  > **************/log_proc00324.log
  │    -   /dev/infiniband/uverbs0
  │    Num of pipes: 6
  │    Num of sockets: 5
  │    Num of anon_inodes: 6
  │ Strace           :
  │    + /xshared/support/systrace/strace: Process 141959 attached
  │    + 01:17:43 read(7, /xshared/support/systrace/strace: Process 141959 detached
  │    +  <detached ...>
  │ Num of subprocs  : 0
----------------------------------------------------------------------------------------

- Version Infomaition
   Intel Compiler: 18.5.234
   Intel MPI: 18.4.234
   DAPL: ofa-v2-mlx5_0-1u

- MPI options I used

declare -x I_MPI_DAPL_UD="1"
declare -x I_MPI_FABRICS="dapl"
declare -x I_MPI_HYDRA_BOOTSTRAP="lsf"
declare -x I_MPI_PIN="1"
declare -x I_MPI_PIN_PROCESSOR_LIST="0-5,24-29"
declare -x I_MPI_ROOT="**************/apps/intel/18.4/compilers_and_libraries/linux/mpi"

- And the code I used

After MPI_FINALIZED, there are 5 lines of codes that are if, close, and deallocate command.
Can these cause the hang problem?

! last part of main_program

call fin_common_par

(there is nothing)

endprogram


!!!!!!!!!!!!!!!!!

subroutine fin_common_par
implicit none
integer :: ierr

call mpi_finalize(ierr)
call fin_log

if(allocated(ranks_per_node)) deallocate(ranks_per_node)
if(allocated(stride_ranks))         deallocate(stride_ranks)

return
end subroutine fin_common_par

!!!!!!!!!!!!!!!!!

subroutine fin_log
implicit none

if(logf_unit == closed_unit) return
close(logf_funit)
logf_unit = closed_unit

return
endsubroutine fin_log

!!!!!!!!!!!!!!!!!

Additionaly, How can I get call stack of process like this post
https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technolog...

Thank you in advance.


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>