I was running a program on skylake nodes. If I run it using one node (np=2, ph=2), the program is able to complete successfully. However, if I run it using two nodes (np=2, ph=1), I would get the following assertion failure:
rank = 1, revents = 8, state = 8
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2988: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 0
Does anyone know what are the possible causes for this type of assertion failure? Weird thing is: all my colleagues who are using csh can run the program reporting no error, but all other colleagues who are using bash (including me) always saw the same issue failed at the same line (2988).