Performance degradation with MPI_THREAD

Hello,

We've been facing some scalability issues with our MPI-based distributed application, so we decided to write a simple microbenchmark program, mimicking the same "all-to-all, pairwise" communication pattern, in an attempt to better understand where the performance degradation comes from. Our results indicate that increasing the number of threads severly impacts the overall throughput we can achieve. This is a rather surprising conclusion, so we wanted to check with you whether this is to be expected, or maybe it's a performance issue of the library itself.

So, what we're trying to do is to hash-partition some data from a set of producers to a set of consumers. In our microbenchmark, however, we've completely eliminated all computation, so we're just sending the same data over and over again. We have one or more processes per node, with t"sender" threads and t"receiver" threads. Below is the pseudocode for the sender and receiver bodies (initialization and finalization code excluded for brevity):

Sender:

for (iter = 0; iter < num_iters; iter++) {
    //pre-process input data, compute hashes, etc. (only in our real application)
    for (dest = 0; dest < num_procs * num_threads; dest++) {
        active_buf = iter % num_bufs;
        MPI_Wait(req[dest][active_buf], MPI_STATUS_IGNORE);
        //copy relevant data into buf[dest][active_buf] (only in our real application)
        MPI_Issend(buf[dest][active_buf], buf_size, MPI_BYTE,
                    dest/num_threads, dest, MPI_COMM_WORLD,&req[dest][active_buf]);
    }
}

Receiver:

for (iter = 0; iter < num_iters; iter++) {
    MPI_Waitany(num_bufs, buf, &active_buf, MPI_STATUS_IGNORE);
    //process data in buf[active_buf] (only in our real application)
    MPI_Irecv(buf[active_buf], buf_size, MPI_BYTE,
                MPI_ANY_SOURCE, receiver_id, MPI_COMM_WORLD,&req[active_buf]);
}

For our experiments, we always used the same 6 nodes (2 CPUs x 6 cores, Infiniband interconnect) and kept the following parameters fixed:

num_iters = 150
buf_size = 128kb
num_bufs = 2, for both senders and receivers (note that senders have 2 buffers per destination, while receivers have 2 buffers in total)

Varying the number of threads and the number of processes per node, we obtained the following results:

total data size    1 process per node     4 threads per process   1 thread per process
--------------------------------------------------------------------------------------
     0.299 GB        4t : 13.088 GB/s       6x1p : 13.498 GB/s     6x 4p : 11.838 GB/s
     1.195 GB        8t : 12.490 GB/s       6x2p : 12.502 GB/s     6x 8p : 10.203 GB/s
     2.689 GB       12t :  6.861 GB/s       6x3p : 12.199 GB/s     6x12p : 10.018 GB/s
     4.781 GB       16t :  4.824 GB/s       6x4p : 11.808 GB/s     6x16p :  9.750 GB/s
     7.471 GB       20t :  3.950 GB/s       6x5p : 11.534 GB/s     6x20p :  9.485 GB/s
    10.758 GB       24t :  3.610 GB/s       6x6p :  9.784 GB/s     6x24p :  9.225 GB/s

Notes:

by t threads we actually mean t sender threads and t receiver threads, so each process consists of 2*t+1 threads in total
we realize that we're overallocating, as our machines have only 12 cores, but we thought that it shouldn't be an issue in our microbenchmark, as we're barely doing any computation and we're completely network bound
we noticed with "top" that there is always one core with 100% utilization per process

As you can see, there is a significant difference between e.g. running 1 process per node with 20 threads (~4 GB/s) and 5 process per node with 4 threads each (~11.5 GB/s). It looks to us like the implementation of MPI_THREAD_MULTIPLE uses some shared resources, which becomes very inefficient as contention increases.

So our questions are:

Is this performance degradation to be expected / a known fact? We couldn't find any resources that would indicate so.
Are our speculations anywhere close to reality? If so, is there a way to increase the number of shared resources in order to decrease contention and maintain the same throughput while still running 1 process per node?
In general, is there a better, more efficient way we could implement this n-to-n data partitioning? I assume this is quite a common problem.

Any feedback is highly appreciated and we are happy to provide any additional information that you may request.

Regards,

- Adrian

Performance degradation with MPI_THREAD_MULTIPLE

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List