(somehow, my previous post 700216 was initially in a draft state, then got published, but didn't appear on the mailing list)
Hi,
the I_MPI_PIN_* variables can be used to set to pretty much any cpu-mask for the MPI-ranks that are used. Unfortunately, the Intel MPI library doesn't set the mask correctly for processes that are dynamically spawned.
Here's an example to show the problem:
program mpispawn use mpi implicit none integer ierr,errcodes(1),intercomm,pcomm,mpisize,dumm,rank character(1000) cmd logical master call MPI_Init(ierr) call get_command_argument(0,cmd) print*,'cmd=',trim(cmd) call MPI_Comm_get_parent(pcomm,ierr) if (pcomm.eq.MPI_COMM_NULL) then print*,'I am the master. Clone myself!' master=.true. call MPI_Comm_spawn(cmd,MPI_ARGV_NULL,4,MPI_INFO_NULL,0,MPI_COMM_WORLD,pcomm,errcodes,ierr) call MPI_Comm_size(pcomm,mpisize,ierr) print*,'Processes in intercommunicator:',mpisize dumm=88 call MPI_Bcast(dumm,1,MPI_INTEGER,MPI_ROOT,pcomm,ierr) else print*,'I am a clone. Use me' master=.false. call MPI_Bcast(dumm,1,MPI_INTEGER,0,pcomm,ierr) endif call MPI_Comm_rank(pcomm,rank,ierr) print*,'rank,master,dumm=',rank,master,dumm call sleep(300) call MPI_Barrier(pcomm,ierr) call MPI_Finalize(ierr) end
I run this example on 2 nodes, each with 2 8-core CPUs. I request core binding and domains with scattered ordering and 3 processes per node (so ranks are bound round-robin to the sockets). mpirun starts 2 MPI-processes, and these spawn 4 further MPI-processes:
[donners@int1 mpispawn]$ I_MPI_DEBUG=4 mpirun -n 2 -hosts "int1,int2" -ppn 3 -binding "pin=yes;cell=core;domain=1;order=scatter" ./mpi.impi [0] MPI startup(): Multi-threaded optimized library [0] MPI startup(): shm data transfer mode [1] MPI startup(): shm data transfer mode [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 31036 int1.cartesius.surfsara.nl {0} [0] MPI startup(): 1 31037 int1.cartesius.surfsara.nl {8} cmd=./mpi.impi I am the master. Clone myself! cmd=./mpi.impi I am the master. Clone myself! [1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u [0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u [1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u [1] MPI startup(): shm and dapl data transfer modes [0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u [0] MPI startup(): shm and dapl data transfer modes [0] MPI startup(): reinitialization: shm and dapl data transfer modes [1] MPI startup(): reinitialization: shm and dapl data transfer modes [0] MPI startup(): Multi-threaded optimized library [0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u [1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u [3] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u [2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u [1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u [1] MPI startup(): shm and dapl data transfer modes [3] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u [3] MPI startup(): shm and dapl data transfer modes [0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u [0] MPI startup(): shm and dapl data transfer modes [2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u [2] MPI startup(): shm and dapl data transfer modes [0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000 [0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000 [1] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000 [1] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000 [2] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000 [2] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000 [3] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000 [3] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000 Processes in intercommunicator: 2 rank,master,dumm= 1 T 88 Processes in intercommunicator: 2 rank,master,dumm= 0 T 88 [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 31045 int1.cartesius.surfsara.nl {0} [0] MPI startup(): 1 24519 int2.cartesius.surfsara.nl {0} [0] MPI startup(): 2 24520 int2.cartesius.surfsara.nl {8} [0] MPI startup(): 3 24521 int2.cartesius.surfsara.nl {1} cmd=./mpi.impi I am a clone. Use me rank,master,dumm= 0 F 88 cmd=./mpi.impi I am a clone. Use me cmd=./mpi.impi I am a clone. Use me cmd=./mpi.impi I am a clone. Use me rank,master,dumm= 1 F 88 rank,master,dumm= 2 F 88 rank,master,dumm= 3 F 88
The 2 initial processes are bound correctly, each round-robin to the first core of each sockets on the first node. However, the first dynamically spawned rank is also bound to the first core, but it seems that this should have been the second core. Now it competes with an initial process for the same core. Note that the processes that were dynamically spawned, do get distributed correctly across nodes. Also the binding on the second node is correct.
What can be done to bind all dynamically spawned processes correctly?