Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Debugging 'Too many communicators'-Error

$
0
0

I have a large code, that fails with the Error:

Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc4027cf0, color=0, key=0, new_comm=0x7ffdb50f2bd0) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(676): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc401bcf1, color=1, key=0, new_comm=0x7ffed5aa4fd0) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(676): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc4027ce9, color=0, key=0, new_comm=0x7ffe37e477d0) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(676): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc401bcf1, color=1, key=0, new_comm=0x7ffd511ac4d0) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(676): Too many communicators (0/16384 free on this process; ignore_id=0)

I and would like to debug it. I can reproduce this error in totalview.

My first idea is to the stacktrace at the point of the Error. It I set a breakpoint to the call of "Get_contextid_sparse_group" or "Comm_split_impl", the error occurs before the breakpoint and totalview just closes.

If I set it to "Comm_split" i have so many breakpoint, that I can't find the correct one. How can I set a breakpoint in IntelMPI's errorhandeling routine. Some routine must print this "Too many communicators" error-message. Can I set my break-point there?

My second idea is to monitor the number of communicators somehow. The line

Too many communicators (0/16384 free on this process; ignore_id=0)

indicates, that MPI knows how many communicators are free at any given time. How can I, as a developer, monitor this number? Is there a function  I call returning the number of current communicators?

I am open for other ideas on how to track down this "communicator leak"


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>