The same code, submitted to the queue (SGE) on our cluster, crashes right away some of the time (25% of the cases?) on the following error message:
libmpifort.so.12 00002AC615FAD9BC Unknown Unknown Unknown magic.exe 00000000004D4B01 step_time_mod_mp_ 335 m_step_time.F90 magic.exe 00000000004FA8EA MAIN__ 301 magic.F90 magic.exe 00000000004042CE Unknown Unknown Unknown libc.so.6 0000003C23C1D994 Unknown Unknown Unknown magic.exe 00000000004041E9 Unknown Unknown Unknown [mpiexec@compute-8-21.local] control_cb (../../pm/pmiserv/pmiserv_cb.c:764): assert (!closed) failed [mpiexec@compute-8-21.local] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@compute-8-21.local] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:480): error waiting for event [mpiexec@compute-8-21.local] main (../../ui/mpich/mpiexec.c:945): process manager error waiting for completion
Line 335 is a 'call mpi_barrier()' hence the libmpifort.so.12 I presume.
Since we use the Infiniband (I_MPI_FABRICS="shm:ofa") I checked that the IB is working with the exact same host list, using a trivial ring passing test program (in C and in F90). The ring passing programs completes fine, every time. Any clue how to investigate this?
The 'magic.exe' program (3rd party, scientific large simulation code) produces the following warning(s) although ti contineu running, when it starts ok - this could be unrelated.
[95] ERROR - handle_read_individual(): Get one packet, but need to be packetized 10628, 1 4604, 12280 [95] ERROR - handle_read_individual(): Get one packet, but need to be packetized 10628, 1 4604, 12280
Any help appreciated.
Sylvain,
BTW:
% ldd magic.exe linux-vdso.so.1 => (0x00007ffff37ef000) libmpifort.so.12 => /software/intel_2015/impi/5.0.1.035/intel64/lib/libmpifort.so.12 (0x00002ab829b40000) libmpi.so.12 => /software/intel_2015/impi/5.0.1.035/intel64/lib/debug/libmpi.so.12 (0x00002ab829dcd000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003e4de00000) librt.so.1 => /lib64/librt.so.1 (0x0000003e4ea00000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003e4e200000) libm.so.6 => /lib64/libm.so.6 (0x0000003e4da00000) libc.so.6 => /lib64/libc.so.6 (0x0000003e4d600000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003e5c800000) /lib64/ld-linux-x86-64.so.2 (0x0000003e4d200000) and mpirun is aliased to /software/intel_2015/impi/5.0.1.035/bin64/mpirun, so it should not be a problem of mixing MPI implementations (we do support Intel, PGI and GNU).