I was running a MPI program on Skylake nodes. The program is able to finish on only one node (np=2, ph=2) and reports no error. However, if I run it using two Skylake nodes (np=2, ph=1), I would get the following error:
rank = 1, revents = 8, state = 8
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2988: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 0
Weird thing is, all my colleagues using csh can finish the run without reporting this failed assertion, and who are using bash (including me) always get this assertion issue failed at the same line (2988). What could be the potential causes for this type of error?