Hi There,
I have a system with 6 computenodes, /opt folder is nfs shared and intel parallel studio cluster version installed on nfs server.
I am using slurm as workload manager. When i run a vasp job on 1 node there is no problem, But when i start to run the job on 2 or more nodes i am getting the following errors;
rank = 28, revents = 29, state = 1
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2988: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 0
I tested the ssh between computenodes with sshconnectivity.exp /nodefile
The user information is shared over ldap server which is headnode.
I couldn't find a working solution in the net. Do anyone has ever had this error?
Thanks.