Hi,
one of our user report us a problem with MPI_Gatherv of intelmpi 2017.
The problem is related to the maximum number of irecv requests in flight.
To reproduce the problem we set up a test case and run it (the code used is shown below) with 72 MPI tasks on two nodes, each one containing 2x Broadwell processors (18 cores per socket). The inter-node communication fabric is Omni-Path.
At runtime the program crashes returning the following error message:
Exhausted 1048576 MQ irecv request descriptors, which
usually indicates a user program error or insufficient request
descriptors (PSM2_MQ_RECVREQS_MAX=1048576)
By setting the value of the variable PSM2_MQ_RECVREQS_MAX to a higher value seems to solve the problem.
Also putting an MPI barrier after the gatherv call solves the problem, although with the side effect of forcing tasks synchronization.
Two questions now arise:
1. Are there any known side-effects by setting PSM2_MQ_RECVREQS_MAX to a very large value?
Can that affect resource requirements of my program, just as memory for example?
2. Alternatively, is there a more robust way to limit the maximum number of irecv requests in flight, so as not to cause the program fault?
Best Regards,
Stefano
Here is the code:
#include "mpi.h" #include <stdio.h> #include <stdlib.h> #include <assert.h> int main(int argc, char **argv) { MPI_Init(&argc, &argv); int size, rank; MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); int iterations = 100000; int send_buf[1] = {rank}; int *recv_buf = NULL; int *recvcounts = NULL; int *displs = NULL; int recv_buf_size = size; if (rank == 0) { recv_buf = calloc(recv_buf_size, sizeof(*recv_buf)); for (int i = 0; i < recv_buf_size; i++) { recv_buf[i] = -1; } recvcounts = calloc(size, sizeof(*recvcounts)); displs = calloc(size, sizeof(*displs)); for (int i = 0; i < size; i++) { recvcounts[i] = 1; displs[i] = i; } } int ten_percent = iterations / 10; int progress = 0; MPI_Barrier(MPI_COMM_WORLD); for (int i = 0; i < iterations; i++) { if (i >= progress) { if (rank == 0) printf("Starting iteration %d\n", i); progress += ten_percent; } MPI_Gatherv(send_buf, 1, MPI_INT, recv_buf, recvcounts, displs, MPI_INT, 0, MPI_COMM_WORLD); } if (rank == 0) { for (int i = 0; i < recv_buf_size; i++) { assert(recv_buf[i] == i); } } free(recv_buf); free(recvcounts); free(displs); MPI_Finalize(); }