Hello,
We've been facing some scalability issues with our MPI-based distributed application, so we decided to write a simple microbenchmark program, mimicking the same "all-to-all, pairwise" communication pattern, in an attempt to better understand where the performance degradation comes from. Our results indicate that increasing the number of threads severly impacts the overall throughput we can achieve. This is a rather surprising conclusion, so we wanted to check with you whether this is to be expected, or maybe it's a performance issue of the library itself.
So, what we're trying to do is to hash-partition some data from a set of producers to a set of consumers. In our microbenchmark, however, we've completely eliminated all computation, so we're just sending the same data over and over again. We have one or more processes per node, with t"sender" threads and t"receiver" threads. Below is the pseudocode for the sender and receiver bodies (initialization and finalization code excluded for brevity):
Sender:
for (iter = 0; iter < num_iters; iter++) { //pre-process input data, compute hashes, etc. (only in our real application) for (dest = 0; dest < num_procs * num_threads; dest++) { active_buf = iter % num_bufs; MPI_Wait(req[dest][active_buf], MPI_STATUS_IGNORE); //copy relevant data into buf[dest][active_buf] (only in our real application) MPI_Issend(buf[dest][active_buf], buf_size, MPI_BYTE, dest/num_threads, dest, MPI_COMM_WORLD,&req[dest][active_buf]); } }
Receiver:
for (iter = 0; iter < num_iters; iter++) { MPI_Waitany(num_bufs, buf, &active_buf, MPI_STATUS_IGNORE); //process data in buf[active_buf] (only in our real application) MPI_Irecv(buf[active_buf], buf_size, MPI_BYTE, MPI_ANY_SOURCE, receiver_id, MPI_COMM_WORLD,&req[active_buf]); }
For our experiments, we always used the same 6 nodes (2 CPUs x 6 cores, Infiniband interconnect) and kept the following parameters fixed:
- num_iters = 150
- buf_size = 128kb
- num_bufs = 2, for both senders and receivers (note that senders have 2 buffers per destination, while receivers have 2 buffers in total)
Varying the number of threads and the number of processes per node, we obtained the following results:
total data size 1 process per node 4 threads per process 1 thread per process -------------------------------------------------------------------------------------- 0.299 GB 4t : 13.088 GB/s 6x1p : 13.498 GB/s 6x 4p : 11.838 GB/s 1.195 GB 8t : 12.490 GB/s 6x2p : 12.502 GB/s 6x 8p : 10.203 GB/s 2.689 GB 12t : 6.861 GB/s 6x3p : 12.199 GB/s 6x12p : 10.018 GB/s 4.781 GB 16t : 4.824 GB/s 6x4p : 11.808 GB/s 6x16p : 9.750 GB/s 7.471 GB 20t : 3.950 GB/s 6x5p : 11.534 GB/s 6x20p : 9.485 GB/s 10.758 GB 24t : 3.610 GB/s 6x6p : 9.784 GB/s 6x24p : 9.225 GB/s
Notes:
- by t threads we actually mean t sender threads and t receiver threads, so each process consists of 2*t+1 threads in total
- we realize that we're overallocating, as our machines have only 12 cores, but we thought that it shouldn't be an issue in our microbenchmark, as we're barely doing any computation and we're completely network bound
- we noticed with "top" that there is always one core with 100% utilization per process
As you can see, there is a significant difference between e.g. running 1 process per node with 20 threads (~4 GB/s) and 5 process per node with 4 threads each (~11.5 GB/s). It looks to us like the implementation of MPI_THREAD_MULTIPLE uses some shared resources, which becomes very inefficient as contention increases.
So our questions are:
- Is this performance degradation to be expected / a known fact? We couldn't find any resources that would indicate so.
- Are our speculations anywhere close to reality? If so, is there a way to increase the number of shared resources in order to decrease contention and maintain the same throughput while still running 1 process per node?
- In general, is there a better, more efficient way we could implement this n-to-n data partitioning? I assume this is quite a common problem.
Any feedback is highly appreciated and we are happy to provide any additional information that you may request.
Regards,
- Adrian