Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Error message form dapl_module_poll.c while running MPI job

$
0
0

Dear experts,

 

I have experienced an error while running a parallel code compiled with Intel MPI. To start my jobs I am using an environmental variable $DO_PARALLEL, having the following content:

mpiexec -machinefile /tmp/user/mpihosts-22 -np 16 -env I_MPI_DEBUG 5

Our cluster uses PBS to submit jobs.

I am getting a rather unpredictable behavior, sometimes my code runs without problems, while others It fails with the following error:

OS: Scientific Linux SL release 5.5 (Boron)
[0:n010106] unexpected DAPL connection event 0x4008 from 34
Assertion failed in file ../../dapl_module_poll.c at line 4287: 0
internal ABORT - process 0
[9:n010404] unexpected disconnect completion event from [0:n010106]
[11:n010404] unexpected disconnect completion event from [0:n010106]
[22:n010312] unexpected disconnect completion event from [0:n010106]
Assertion failed in file ../../dapl_module_util.c at line 1593: 0
Assertion failed in file ../../dapl_module_util.c at line 1593: 0
Assertion failed in file ../../dapl_module_util.c at line 1593: 0I guess that this is a communication problem.
[7:n010106] unexpected disconnect completion event from [15:n010404]
Assertion failed in file ../../dapl_module_util.c at line 1593: 0
internal ABORT - process 7

Each node is equipped with 8 Quad-Core Intel® Xeon® Processor 5400 Series processors and has 16 GB of memory.

I have performed a little research on the internet and came to the conclusion that this might be a communication issue. Those errors started appearing, when I began communicating large arrays to the slaves. I would appreciate any Ideas and/or explanations what is the reasoning behind this rather strange behavior.

 

Thanks,

Alex


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>