Need help making sense of NBC performance (MPI3)

Hello everyone,

I am fairly new to parallel computing, but am working on a certain legacy code that uses real-space domain decomposition for electronic structure calculations. I have spent a while modernizing the main computational kernel to hybrid MPI+openMP and upgraded the communication pattern to use nonblocking neighborhood alltoallv for the halo exchange and a nonblocking allreduce for the other communication in the kernel. I have now started to focus on "communication hiding", so that the calculations and communication happen alongside each other.

EDIT: It is probably useful to note that the timescale and the message sizes are unrepresentative of actual applications, the issue is not as crucial for large message sizes and long computation time (for the communication hiding). I still want to understand the behavior better...

For this purpose I tried both progressing the communication manually using mpi_test and mpi_testall, but was more successful in using the "spare core" technique and running each processing element (PE) with one less thread while setting MPICH_ASYNC_PROGRESS=1. However, I am probably too "green" to understand the outcome.

For the description of my results, I refer you to the following output from the intel trace collector. I am using the ITAC api to do low overhead instrumentation of the different phases of the kernel. Important details: I am using intel mpi 5.0.2, each node has 2x6 cores and each PE is pinned to its own socket.
The first screenshot is for running a toy problem on 4 nodes, 2 PE each, with 5 threads each, without async progression (more details after the pic):

The multitude of colors may be confusing at first but can be understood readily - The green rectangles ("Filter") are outside of the kernel, each kernel call then begins by a pink buffering block (the alltoallv buffers are being copied into), afterwards, in bright red, a multitude of halos are sent (8 different ones in this case) using Ineighbor_alltoallv. Then, in a purple block, some mathematical operation occurs and one Iallreduce is called. Then there is a darker purple block, after which I call an MPI_test to manually progress the non blocking Iallreduce. Then, in blue, I do a sparse matrix vector product, after which I need to process the halos, so I do MPI_Waitall for EACH ONE (the use of waitall allows me to switch back and forth from MPI2 irecv/isend to MPI3 with a simple ifdef). Then, you can notice the completion of the collective represented as the segmented thick blue lines across the nodes. Finally, the last purple block requires the information from the iallreduce so I do an MPI_Wait on this one (The fairly vertical blue line, the wait time can be seen in grey).

Notice that all the halos arrive almost at the same time, in fact I have the same MPI_Waitall time if I exchange less than 8 halos! This is because the first waitall progresses all the messages (or so I would like to think...)

Now I switch async on, there will be an extra line for each PE (T1) for the MPI thread, which I colored as white so it will not interfere with the important data:

Well, here is the thing I don't understand: First, the use of Iallreduce is really eye-catching here because the synchronization is less strict between PEs, and MPI_Waitall times (between the halo parts) have really decreased. However, now the calls to MPI_Ineighbor_alltoalv are really time consuming and I have lost all that I have gained (in fact, the code is slightly slower than the first example). Because of the lack of consistency between call duration between PEs, I thought that it must be due to the fact that I am saturating my hardware with all the async progression that has to take place, so I decreased the number of PEs to one per node (50% slowdown) while still using 5 threads per PE:

This time, it takes a while for a steady-state to appear, but the most interesting thing is that now I see increases in iallreduce time, and also in mpi_wait.

Can anyone help me or at least tell me if I am doing something wrong? Did I stumble upon a library issue?

I will happily provide more data should you request it,
Dr Ariel Biller

Need help making sense of NBC performance (MPI3)

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112