Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Intel Fortran 2019 + MPI cause an unexpected Segmentation Fault [Linux]

$
0
0

    Hello,

The following code example compiled with `mpiifort` produces a segfault error:
 

module test_intel_mpi_mod
   implicit none
   integer, parameter :: dp = kind(1.0d0)

   type :: Container
      complex(kind=dp), allocatable :: arr(:, :, :)
   end type

contains
   subroutine test_intel_mpi()
      use mpi_f08, only: &
         MPI_Init_thread, &
         MPI_THREAD_SINGLE, &
         MPI_Finalize, &
         MPI_Comm_rank, &
         MPI_COMM_WORLD, &
         MPI_COMPLEX16, &
         MPI_Bcast

      integer :: provided
      integer :: rank
      type(Container) :: cont

      call MPI_Init_thread(MPI_THREAD_SINGLE, provided)
      call MPI_Comm_rank(MPI_COMM_WORLD, rank)

      allocate(cont % arr(1, 1, 1))

      if (rank == 0) then
         cont % arr(1, 1, 1) = (1.0_dp, 2.0_dp)
      endif

! This works fine --->  call MPI_Bcast(cont % arr(1, 1, 1), 1, MPI_COMPLEX16, 0, MPI_COMM_WORLD)
      call MPI_Bcast(cont % arr(:, :, 1), 1, MPI_COMPLEX16, 0, MPI_COMM_WORLD)

      print *, rank, " after Bcast: ", cont % arr(1, 1, 1)
      call MPI_Finalize()
   end subroutine test_intel_mpi
end module test_intel_mpi_mod

program test_mpi
   use test_intel_mpi_mod

   call test_intel_mpi()
end program test_mpi

 

The code is compiled simply as follows: `mpiifort -o test_mpi test_mpi.f90`  and executed as `mpirun -np N ./test_mpi` (N = 1, 2, ...).

The output for N=2 is the following (also `-g -traceback` was added in this case):

           0  after Bcast:  (1.00000000000000,2.00000000000000)
           1  after Bcast:  (1.00000000000000,2.00000000000000)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
test_mpi           000000000041475A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AE5C8C0A5D0  Unknown               Unknown  Unknown
test_mpi           000000000040941D  Unknown               Unknown  Unknown
test_mpi           0000000000409D79  Unknown               Unknown  Unknown
test_mpi           00000000004044C0  test_intel_mpi_mo          44  test_mpi.f90
test_mpi           00000000004044E0  MAIN__                     50  test_mpi.f90
test_mpi           0000000000403BA2  Unknown               Unknown  Unknown
libc-2.17.so       00002AE5C913B3D5  __libc_start_main     Unknown  Unknown
test_mpi           0000000000403AA9  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
test_mpi           000000000041475A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AB3DEFB75D0  Unknown               Unknown  Unknown
test_mpi           000000000040941D  Unknown               Unknown  Unknown
test_mpi           0000000000409D79  Unknown               Unknown  Unknown
test_mpi           00000000004044C0  test_intel_mpi_mo          44  test_mpi.f90
test_mpi           00000000004044E0  MAIN__                     50  test_mpi.f90
test_mpi           0000000000403BA2  Unknown               Unknown  Unknown
libc-2.17.so       00002AB3DF4E83D5  __libc_start_main     Unknown  Unknown
test_mpi           0000000000403AA9  Unknown               Unknown  Unknown

 

The program crashes when it tries to exit the subroutine. The problem seems to be related to passing of the array section, `cont % arr(:, :, 1)`, to MPI_Bcast, as opposed to a reference to the first element, `cont % arr(1, 1, 1)` (this version of the call is left commented in the source code provided). At the same time, my understanding of the standard is that array sections, contiguous or not, are explicitly allowed in MPI 3.x (e.g., see https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node409.htm).

Details in the source are important to reproduce the segfault:

  • The crash happens only if MPI_Bcast is called -- commenting it out prevents the error
  • The subroutine must be in a module
  • The array must be at least 3-dimensional, allocatable, and be contained in a derived type object
  • Non-blocking MPI_Ibcast, as well as other collectives implying broadcast (e.g., Allreduce) give the same result

Compiler/library versions:

Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.1.217 Build 20200306

IntelMPI is from the same build: 2019.7.pre-intel-19.1.0.166-7

Output with I_DEBUG_MPI=6:

[0] MPI startup(): libfabric version: 1.9.0a1-impi
[0] MPI startup(): libfabric provider: psm2
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       278604   l49        {0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23}
[0] MPI startup(): 1       278605   l49        {8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31}
[0] MPI startup(): I_MPI_ROOT=....
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=6

OS: CentOS Linux release 7.6.1810 (Core)

Kernel: 3.10.0-957.10.1.el7.x86_64


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>