Hi everyone,
We found the following behavior in Intel MPI (5.0.3) using both the intel compilers and gcc:
In an OpenMP-MPI environment, the performance of MPI_Comm_rank goes down if MPI is initialized using MPI_THREAD_MULTIPLE. I attach two files to show the behavior. They can be compiled with
mpiicpc main.cpp -o test.exe -openmp
mpiicpc main2.cpp -o test_nothreads.exe -openmp
Both executables do a simple parallelized for loop two times; the first time, an arithmetic operation is performed a lot of times. The second time, there are additional calls to MPI_Comm_rank within the loop.
test.exe uses MPI_THREAD_MULTIPLE. Here is a typical example of the runtime (with one thread, OMP_NUM_THREADS=1) for the two loops:
MPI_THREAD_MULTIPLE w/o rank: 0.0411851
MPI_THREAD_MULTIPLE w rank: 1.03309
test_no_threads.exe doesn't use MPI_THREAD_MULTIPLE, and we get:
w/o rank: 0.0452909
w rank: 0.181268
This slowdown gets a lot more severe if we do this with e.g. 16 OpenMP threads:
MPI_THREAD_MULTIPLE w rank: 6.07238
versus
w rank: 0.345186
Using a profiler we find that there is spin lock in MPI_Comm_rank that is responsible for the slowdown.
I see that with MPI_THREAD_MULTIPLE, there needs to be some locking for MPI operations. However, I do not see why this should be the case in MPI_Comm_rank, since I assume this to be a rather local operation - e.g. in OpenMPI, this is internally just returning a member of a struct, namely the process ID.
Therefore, I would like to understand if this is a known problem or a bug.
All the best, Christoph.
I can not attach cpp files for some reason, so here is just the code:
main2.cpp:
#include "mpi.h" #include "omp.h" #include <iostream> #include "math.h" #include "stdlib.h" int main(int argc,char* args[]) { MPI_Init(NULL,NULL); long n=1000000; double start = MPI_Wtime(); double *d = new double[n]; double *d2 = new double[n]; #pragma omp parallel for for(long i=0;i<n;i++) { d2[i] = cos(d[i])*pow(d[i],3.0); } delete[] d; delete[] d2; double end1 = MPI_Wtime(); std::cout << "w/o rank: "<< end1-start << std::endl; d = new double[n]; d2 = new double[n]; #pragma omp parallel for for(long i=0;i<n;i++) { int myProcID; for(int j=0;j<10;j++) MPI_Comm_rank(MPI_COMM_WORLD,&myProcID); d2[i] = cos(d[i])*pow(d[i],3.0); } double end2 = MPI_Wtime(); std::cout << "w rank: "<< end2-end1 << std::endl; MPI_Finalize(); return 0; }
main.cpp:
#include "math.h" #include "mpi.h" #include "omp.h" #include <iostream> #include "stdlib.h" int main(int argc,char* args[]) { int required = MPI_THREAD_MULTIPLE; int provided = 0; MPI_Init_thread(NULL,NULL,required,&provided); if(provided!=required) { std::cout << "Error: MPI thread support insufficient! required "<< required << " provided "<< provided; abort(); } long n=1000000; double start = MPI_Wtime(); double *d = new double[n]; double *d2 = new double[n]; #pragma omp parallel for for(long i=0;i<n;i++) { d2[i] = cos(d[i])*pow(d[i],3.0); } delete[] d; delete[] d2; double end1 = MPI_Wtime(); std::cout << "MPI_THREAD_MULTIPLE w/o rank: "<< end1-start << std::endl; d = new double[n]; d2 = new double[n]; #pragma omp parallel for for(long i=0;i<n;i++) { int myProcID; for(int j=0;j<10;j++) MPI_Comm_rank(MPI_COMM_WORLD,&myProcID); d2[i] = cos(d[i])*pow(d[i],3.0); } double end2 = MPI_Wtime(); std::cout << "MPI_THREAD_MULTIPLE w rank: "<< end2-end1 << std::endl; MPI_Finalize(); return 0; }