Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Problem runing Intel MPI w/ IB

$
0
0

The same code, submitted to the queue (SGE) on our cluster, crashes right away some of the time (25% of the cases?) on the following error message:

libmpifort.so.12   00002AC615FAD9BC  Unknown               Unknown  Unknown
magic.exe          00000000004D4B01  step_time_mod_mp_         335  m_step_time.F90
magic.exe          00000000004FA8EA  MAIN__                    301  magic.F90
magic.exe          00000000004042CE  Unknown               Unknown  Unknown
libc.so.6          0000003C23C1D994  Unknown               Unknown  Unknown
magic.exe          00000000004041E9  Unknown               Unknown  Unknown
[mpiexec@compute-8-21.local] control_cb (../../pm/pmiserv/pmiserv_cb.c:764): assert (!closed) failed
[mpiexec@compute-8-21.local] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@compute-8-21.local] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:480): error waiting for event
[mpiexec@compute-8-21.local] main (../../ui/mpich/mpiexec.c:945): process manager error waiting for completion

Line 335 is a 'call mpi_barrier()' hence the libmpifort.so.12 I presume.

Since we use the Infiniband (I_MPI_FABRICS="shm:ofa") I checked that the IB is working with the exact same host list, using a trivial ring passing test program (in C and in F90). The ring passing programs completes fine, every time. Any clue how to investigate this?

The 'magic.exe' program (3rd party, scientific large simulation code) produces the following warning(s) although ti contineu running, when it starts ok - this could be unrelated.

[95] ERROR - handle_read_individual(): Get one packet, but need to be packetized  10628, 1 4604, 12280
[95] ERROR - handle_read_individual(): Get one packet, but need to be packetized  10628, 1 4604, 12280

Any help appreciated.

Sylvain,

BTW:

% ldd magic.exe
        linux-vdso.so.1 =>  (0x00007ffff37ef000)
        libmpifort.so.12 => /software/intel_2015/impi/5.0.1.035/intel64/lib/libmpifort.so.12 (0x00002ab829b40000)
        libmpi.so.12 => /software/intel_2015/impi/5.0.1.035/intel64/lib/debug/libmpi.so.12 (0x00002ab829dcd000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003e4de00000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003e4ea00000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003e4e200000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003e4da00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003e4d600000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003e5c800000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003e4d200000)
and mpirun is aliased to /software/intel_2015/impi/5.0.1.035/bin64/mpirun,
so it should not be a problem of mixing MPI implementations (we do support
Intel, PGI and GNU).

 


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>