Hi, I have a question about an efficient method for extending well implemented scientific code written using C++ and OpenMP, to MPI layer.
This code is architecture-aware implementation (ccNuma, affinity, caches, etc..) that can utilize different aspects of architectures, and especially used all threads.
The main goal is to implement MPI layer without performance losses on the exits shared memory code, and do it efficiently.
So, I have to overlap MPI communications with OpenMP computations. My application allows for achieving this goal since I perform a loop-blocking technique.
Shortly speaking: When the results from the one block can be send to another MPI rank, the OpenMP threads can perform computations – such schema is repeated several time, and after it the synchronization point is necessary. Then, such a structure is run thousand times.
The main requirement/limitations of MPI communication will be a lot of small portions of data for exchanging (a lot of data bars of size 1.5 KB or 3 KB from 3D arrays)
This code will be run on rather novel hardware and software :
- Intel CPU cluster
- Intel MIC cluster: MPI communication between KNC (and KNL similar to 1.)
- Hybrid: MPI communication between CPUs and MICs
The general question how to do it in an efferent way: I do not ask about implementation details but which MPI scenarios can guaranties the best performance.
In details:
- Does the MPI communication cause any cores overheads – I men when I run both MPI communications and OMP computations at the same time but on different memory region
- Should I allocated MPI communication for a separate (dedicated for this task) core, when other cores will perform OMP computations, which scenarios will be more efficiently:
- OMP master or a single threads blinded to a single physical core run communication only, other OMP threads use others cores for computation
- which communication will be better here synchronous or asynchronous ?? - a selected group of OMP threads for MPI communication and computations while others OMP threads for computations only
- or other solutions ??
- OMP master or a single threads blinded to a single physical core run communication only, other OMP threads use others cores for computation
In fact, the 2.b is most suitable for my application, but the programmer is responsible to guaranties the right MPI communication paths between MPI ranks and OMP threads.
If any can help me or share with me his advance experience I will be very happy.
Lukasz