Hi All,
I will explain the current situation and the attached file.
The MPI application performed with LSF is currently debugging due to a problem that does not terminate the operation. Currently, the code level suspects mpi_finalize, and it occurs randomly, not every time, so we need to check more about the occurrence conditions. I inquired about similar symptoms in MPI forum, but the result was not known as post went to the ticket in the middle.
Please check if it is a similar symptom.
https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technolog...
- Strace results of MPI executing hosts (I suspected this error "│ + 01:17:43 read(7 ")
----------------------------------------------------------------------------------------
duru0403 has 24 procs as below:
* Name/State : pmi_proxy / State: S (sleeping)[m
PID/PPID : 141955 / 141954
Commandline : **************/apps/intel/18.4/impi/2018.4.274/intel64/bin/pmi_proxy --control-port duru0374:37775 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk lsf --launcher lsf --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1000390395 --usize -2 --proxy-id -1
CPU/MEMs_allowed : 0-95 / 0-3
[<ffffffff96e56e55>] poll_schedule_timeout+0x55/0xb0
[<ffffffff96e585dd>] do_sys_poll+0x48d/0x590
[<ffffffff96e587e4>] SyS_poll+0x74/0x110
[<ffffffff97374ddb>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff
Files :
Num of pipes: 26
Num of sockets: 16
Num of anon_inodes: 0
Strace :
+ /xshared/support/systrace/strace: Process 141955 attached
+ 01:17:43 restart_syscall(<... resuming interrupted poll ...>/xshared/support/systrace/strace: Process 141955 detached
+ <detached ...>
Num of subprocs : 23
│
├─Name/State : ensda / State: S (sleeping)[m
│ PID/PPID : 141959 / 141955
│ Commandline : **************
│ CPU/MEMs_allowed : 0 / 0-3
│ [<ffffffff972f5139>] unix_stream_read_generic+0x309/0x8e0
│ [<ffffffff972f5804>] unix_stream_recvmsg+0x54/0x70
│ [<ffffffff972186ec>] sock_aio_read.part.9+0x14c/0x170
│ [<ffffffff97218731>] sock_aio_read+0x21/0x30
│ [<ffffffff96e404d3>] do_sync_read+0x93/0xe0
│ [<ffffffff96e40fb5>] vfs_read+0x145/0x170
│ [<ffffffff96e41dcf>] SyS_read+0x7f/0xf0
│ [<ffffffff97374ddb>] system_call_fastpath+0x22/0x27
│ [<ffffffffffffffff>] 0xffffffffffffffff
│ Files :
│ - > /dev/infiniband/uverbs0
│ - > **************/log_proc00324.log
│ - /dev/infiniband/uverbs0
│ Num of pipes: 6
│ Num of sockets: 5
│ Num of anon_inodes: 6
│ Strace :
│ + /xshared/support/systrace/strace: Process 141959 attached
│ + 01:17:43 read(7, /xshared/support/systrace/strace: Process 141959 detached
│ + <detached ...>
│ Num of subprocs : 0
----------------------------------------------------------------------------------------
- Version Infomaition
Intel Compiler: 18.5.234
Intel MPI: 18.4.234
DAPL: ofa-v2-mlx5_0-1u
- MPI options I used
declare -x I_MPI_DAPL_UD="1"
declare -x I_MPI_FABRICS="dapl"
declare -x I_MPI_HYDRA_BOOTSTRAP="lsf"
declare -x I_MPI_PIN="1"
declare -x I_MPI_PIN_PROCESSOR_LIST="0-5,24-29"
declare -x I_MPI_ROOT="**************/apps/intel/18.4/compilers_and_libraries/linux/mpi"
- And the code I used
After MPI_FINALIZED, there are 5 lines of codes that are if, close, and deallocate command.
Can these cause the hang problem?
! last part of main_program call fin_common_par (there is nothing) endprogram !!!!!!!!!!!!!!!!! subroutine fin_common_par implicit none integer :: ierr call mpi_finalize(ierr) call fin_log if(allocated(ranks_per_node)) deallocate(ranks_per_node) if(allocated(stride_ranks)) deallocate(stride_ranks) return end subroutine fin_common_par !!!!!!!!!!!!!!!!! subroutine fin_log implicit none if(logf_unit == closed_unit) return close(logf_funit) logf_unit = closed_unit return endsubroutine fin_log !!!!!!!!!!!!!!!!!
Additionaly, How can I get call stack of process like this post
https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technolog...
Thank you in advance.