Hi all,
### The problem at hand ###
I am trying to model a workflow where a manager process and multiple worker processes have multiple "conversations" simultaneously.
The manager process had a "main thread" that sends messages to each workers, waits for them to do something and picks up the results (this entire pattern being repeated many times).
It also has a secondary thread that listens on messages from the workers and processes them independently of the main computations (sending back a status to the workers).
The secondary thread is very simple: it looks as follows (pseudo-code):
while (true) {
status = probe(comm) if (status.tag == relatedToThisThread) { msg = receive(comm, status.tag, status.source) status = process(msg) send(comm, status) } }
From the worker's point of view, they may issue calls to the secondary thread at any time in the middle of their work or between two pieces of work.
I tried 2 different approaches so far to model this workflow:
### Case 1: 1 Manager( 1 communicator, 2 threads); N Workers (1 communicator, 1 thread each) ###
The Manager has one communicator with which it can communicate with all Workers.
Both threads share a reference to the same communicator that is hooked to the N Worker processes.
I sometimes get failures where even though I am calling "receive" with a given tag and source; messages addressed to different threads seem to get intermingled occasionally (can be made reproducible by having a large number of worker processes).
### Case 2: 1 Manager ( N communicators, N + 1 threads); N Workers (1 communicator, 1 thread each) ###
The Manager has one communicator per worker (mostly to improve fault tolerance). Each communicator is referenced to by both the main thread and one of the N secondary threads.
This seems more reliable. In particular, it scales better with a large number of workers. However, I cannot convince myself that the same problem I observed in case 1 cannot happen here.
### My question ###
I was wondering if anyone had had to deal with a similar situation. What strategy did you use to solve these issues?
From my initial readings, it seems that MPI_Mprobe and MPI_Mreceive may be the key to solving this problem; but I would welcome any suggestion while I'm experimenting with this situation.