Dear experts,
I have experienced an error while running a parallel code compiled with Intel MPI. To start my jobs I am using an environmental variable $DO_PARALLEL, having the following content:
mpiexec -machinefile /tmp/user/mpihosts-22 -np 16 -env I_MPI_DEBUG 5
Our cluster uses PBS to submit jobs.
I am getting a rather unpredictable behavior, sometimes my code runs without problems, while others It fails with the following error:
OS: Scientific Linux SL release 5.5 (Boron) [0:n010106] unexpected DAPL connection event 0x4008 from 34 Assertion failed in file ../../dapl_module_poll.c at line 4287: 0 internal ABORT - process 0 [9:n010404] unexpected disconnect completion event from [0:n010106] [11:n010404] unexpected disconnect completion event from [0:n010106] [22:n010312] unexpected disconnect completion event from [0:n010106] Assertion failed in file ../../dapl_module_util.c at line 1593: 0 Assertion failed in file ../../dapl_module_util.c at line 1593: 0 Assertion failed in file ../../dapl_module_util.c at line 1593: 0I guess that this is a communication problem. [7:n010106] unexpected disconnect completion event from [15:n010404] Assertion failed in file ../../dapl_module_util.c at line 1593: 0 internal ABORT - process 7
Each node is equipped with 8 Quad-Core Intel® Xeon® Processor 5400 Series processors and has 16 GB of memory.
I have performed a little research on the internet and came to the conclusion that this might be a communication issue. Those errors started appearing, when I began communicating large arrays to the slaves. I would appreciate any Ideas and/or explanations what is the reasoning behind this rather strange behavior.
Thanks,
Alex