Hi
I am trying to spawn processes across nodes using intel mpi with the following code:
testmanager.py:
from mpi4py import MPI import mpi4py import sys import argparse import os import distutils.spawn def check_mpi(): mpiexec_path, _ = os.path.split(distutils.spawn.find_executable("mpiexec")) for executable, path in mpi4py.get_config().items(): if executable not in ['mpicc', 'mpicxx', 'mpif77', 'mpif90', 'mpifort']: continue if mpiexec_path not in path: raise ImportError("mpi4py may not be configured against the same version of 'mpiexec' that you are using. The 'mpiexec' path is {mpiexec_path} and mpi4py.get_config() returns:\n{mpi4py_config}\n".format(mpiexec_path=mpiexec_path, mpi4py_config=mpi4py.get_config())) # if 'Open MPI' not in MPI.get_vendor(): # raise ImportError("mpi4py must have been installed against Open MPI in order for StructOpt to function correctly.") vendor_number = ".".join([str(x) for x in MPI.get_vendor()[1]]) if vendor_number not in mpiexec_path: print(MPI.get_vendor(), mpiexec_path) print(MPI.get_vendor(), mpiexec_path) #raise ImportError("The MPI version that mpi4py was compiled against does not match the version of 'mpiexec'. mpi4py's version number is {}, and mpiexec's path is {}".format(MPI.get_vendor(), mpiexec_path)) def main(): # parser = argparse.ArgumentParser() # parser.add_argument('worker_count', type=int) worker_count = 20 # args = parser.parse_args() check_mpi() mpi_info = MPI.Info.Create() mpi_info.Set("add-hostfile", "slurm.hosts") mpi_info.Set("host", "slurm.hosts") #print("about to spawn") comm = MPI.COMM_SELF.Spawn(sys.executable, args=['testworker.py'], maxprocs=worker_count, info=mpi_info).Merge() process_rank = comm.Get_rank() process_count = comm.Get_size() process_host = MPI.Get_processor_name() print('manager',process_rank, process_count, process_host) main()
testworker.py:
from mpi4py import MPI def main(): print("Spawned") comm = MPI.Comm.Get_parent().Merge() process_rank = comm.Get_rank() process_count = comm.Get_size() process_host = MPI.Get_processor_name() print('worker', process_rank,process_count,process_host) main()
I would like to know how to distribute the spawned processes, as when I run the job as:
mpirun -hostfile slurm.hosts -np 1 python3 ./testmanager.py
with, for example, the following slurm.hosts:
node-105:16 node-114:16 node-127:16
I end up with the manager running on a single process on node-105, and the workers running on the other nodes. If I increase the number of workers beyond that of the total number of slots in the non-manager nodes then the job hangs. I want to be able to run on all available slots on the three nodes.
Thanks!