Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Combining multithreading and IntelMPI

$
0
0

Hi all,

### The problem at hand ###

I am trying to model a workflow where a manager process and multiple worker processes have multiple "conversations" simultaneously.

The manager process had a "main thread" that sends messages to each workers, waits for them to do something and picks up the results (this entire pattern being repeated many times).
It also has a secondary thread that listens on messages from the workers and processes them independently of the main computations (sending back a status to the workers).
The secondary thread is very simple: it looks as follows (pseudo-code):

while (true) {

    status = probe(comm)
    if (status.tag == relatedToThisThread) {
        msg = receive(comm, status.tag, status.source)
        status = process(msg)
        send(comm, status)
    }
}

From the worker's point of view, they may issue calls to the secondary thread at any time in the middle of their work or between two pieces of work.

I tried 2 different approaches so far to model this workflow:

### Case 1: 1 Manager( 1 communicator, 2 threads); N Workers (1 communicator, 1 thread each) ###

The Manager has one communicator with which it can communicate with all Workers.
Both threads share a reference to the same communicator that is hooked to the N Worker processes.

I sometimes get failures where even though I am calling "receive" with a given tag and source; messages addressed to different threads seem to get intermingled occasionally (can be made reproducible by having a large number of worker processes).

### Case 2: 1 Manager ( N communicators, N + 1 threads); N Workers (1 communicator, 1 thread each) ###

The Manager has one communicator per worker (mostly to improve fault tolerance). Each communicator is referenced to by both the main thread and one of the N secondary threads.

This seems more reliable. In particular, it scales better with a large number of workers. However, I cannot convince myself that the same problem I observed in case 1 cannot happen here.

### My question ###

I was wondering if anyone had had to deal with a similar situation. What strategy did you use to solve these issues?

From my initial readings, it seems that MPI_Mprobe and MPI_Mreceive may be the key to solving this problem; but I would welcome any suggestion while I'm experimenting with this situation.

 

Zone: 

Thread Topic: 

Question

Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>