Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Need help making sense of NBC performance (MPI3)

$
0
0

Hello everyone,

I am fairly new to parallel computing, but am working on a certain legacy code that uses real-space domain decomposition for electronic structure calculations. I have spent a while modernizing the main computational kernel to hybrid MPI+openMP and upgraded the communication pattern to use nonblocking neighborhood alltoallv for the halo exchange and a nonblocking allreduce for the other communication in the kernel. I have now started to focus on "communication hiding", so that the calculations and communication happen alongside each other.

EDIT: It is probably useful to note that the timescale and the message sizes are unrepresentative of actual applications, the issue is not as crucial for large message sizes and long computation time (for the communication hiding).  I still want to understand the behavior better...

For this purpose I tried both progressing the communication manually using mpi_test and mpi_testall, but was more successful in using the "spare core" technique and running each processing element (PE) with one less thread while setting MPICH_ASYNC_PROGRESS=1. However, I am probably too "green" to understand the outcome.

For the description of my results, I refer you to the following output from the intel trace collector. I am using the ITAC api to do low overhead instrumentation of the different phases of the kernel. Important details: I am using intel mpi 5.0.2, each node has 2x6 cores and each PE is pinned to its own socket.
The first screenshot is for running a toy problem on 4 nodes, 2 PE each, with 5 threads each, without async progression (more details after the pic):

 

The multitude of colors may be confusing at first but can be understood readily - The green rectangles ("Filter") are outside of the kernel, each kernel call then begins by a pink buffering block (the alltoallv buffers are being copied into), afterwards, in bright red, a multitude of halos are sent (8 different ones in this case) using Ineighbor_alltoallv. Then, in a purple block, some mathematical operation occurs and one Iallreduce is called. Then there is a darker purple block, after which I call an MPI_test to manually progress the non blocking Iallreduce. Then, in blue, I do a sparse matrix vector product, after which I need to process the halos, so I do MPI_Waitall for EACH ONE (the use of waitall allows me to switch back and forth from MPI2 irecv/isend to MPI3 with a simple ifdef). Then, you can notice the completion of the collective represented as the segmented thick blue lines across the nodes. Finally, the last purple block requires the information from the iallreduce so I do an MPI_Wait on this one (The fairly vertical blue line, the wait time can be seen in grey).

Notice that all the halos arrive almost at the same time, in fact I have the same MPI_Waitall time if I exchange less than 8 halos! This is because the first waitall progresses all the messages (or so I would like to think...)

Now I switch async on, there will be an extra line for each PE (T1) for the MPI thread, which I colored as white so it will not interfere with the important data:

Well, here is the thing I don't understand: First, the use of Iallreduce is really eye-catching here because the synchronization is less strict between PEs, and MPI_Waitall times (between the halo parts) have really decreased. However, now the calls to MPI_Ineighbor_alltoalv are really time consuming and I have lost all that I have gained (in fact, the code is slightly slower than the first example).  Because of the lack of consistency between call duration between PEs, I thought that it must be due to the fact that I am saturating my hardware with all the async progression that has to take place, so I decreased the number of PEs to one per node (50% slowdown) while still using 5 threads per PE:

This time, it takes a while for a steady-state to appear, but the most interesting thing is that now I see increases in iallreduce time, and also in mpi_wait.

Can anyone help me or at least tell me if I am doing something wrong? Did I stumble upon a library issue?

I will happily provide more data should you request it,
Dr Ariel Biller


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>