Dear all,
I run the program by the following command:
mpiexec -wdir z:\directional -mapall -hosts 10 n01 5 n02 5 n03 5 n04 5 n05 5 n06 5 n07 5 n08 5 n09 5 n10 5 test
The cluster has 10 nodes with 24 logical cores ( 2*Intel(R) Xeon(R) CPU x5675) on every node. The program test have openMP based parallel calculation in some part, but also a considerable part is not parallelized. However, the problem is that the program 'test' only use 4 cores when running in parallel part (total CPU usage is only 80%), I noticed that when set I_MPI_PIN_DOMAIN=omp, every process 'test' will use all 24 cores. I have tested the program 'test' on one node by
mpiexec -wdir z:\directional -mapall -n 5 test
The program 'test' runs what I wanted (total CPU usage is 100% when in parallel part).
Now the problem is that the first command failed after I set I_MPI_PIN_DOMAIN=omp:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(658)................:
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(104)..................:
MPID_nem_tcp_post_init(345)..........:
MPID_nem_newtcp_module_connpoll(3102):
gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.
What should I do to let the program use 100% CPU on every node?
Thanks,
Zhanghong Tang