Hi everyone,
We found the following behavior in Intel MPI (5.0.3) using both the intel compilers and gcc:
In an OpenMP-MPI environment, the performance of MPI_Comm_rank goes down if MPI is initialized using MPI_THREAD_MULTIPLE. I attach two files to show the behavior. They can be compiled with
mpiicpc main.cpp -o test.exe -openmp
mpiicpc main2.cpp -o test_nothreads.exe -openmp
Both executables do a simple parallelized for loop two times; the first time, an arithmetic operation is performed a lot of times. The second time, there are additional calls to MPI_Comm_rank within the loop.
test.exe uses MPI_THREAD_MULTIPLE. Here is a typical example of the runtime (with one thread, OMP_NUM_THREADS=1) for the two loops:
MPI_THREAD_MULTIPLE w/o rank: 0.0411851
MPI_THREAD_MULTIPLE w rank: 1.03309
test_no_threads.exe doesn't use MPI_THREAD_MULTIPLE, and we get:
w/o rank: 0.0452909
w rank: 0.181268
This slowdown gets a lot more severe if we do this with e.g. 16 OpenMP threads:
MPI_THREAD_MULTIPLE w rank: 6.07238
versus
w rank: 0.345186
Using a profiler we find that there is spin lock in MPI_Comm_rank that is responsible for the slowdown.
I see that with MPI_THREAD_MULTIPLE, there needs to be some locking for MPI operations. However, I do not see why this should be the case in MPI_Comm_rank, since I assume this to be a rather local operation - e.g. in OpenMPI, this is internally just returning a member of a struct, namely the process ID.
Therefore, I would like to understand if this is a known problem or a bug.
All the best, Christoph.
I can not attach cpp files for some reason, so here is just the code:
main2.cpp:
#include "mpi.h"
#include "omp.h"
#include <iostream>
#include "math.h"
#include "stdlib.h"
int main(int argc,char* args[])
{
MPI_Init(NULL,NULL);
long n=1000000;
double start = MPI_Wtime();
double *d = new double[n];
double *d2 = new double[n];
#pragma omp parallel for
for(long i=0;i<n;i++)
{
d2[i] = cos(d[i])*pow(d[i],3.0);
}
delete[] d;
delete[] d2;
double end1 = MPI_Wtime();
std::cout << "w/o rank: "<< end1-start << std::endl;
d = new double[n];
d2 = new double[n];
#pragma omp parallel for
for(long i=0;i<n;i++)
{
int myProcID;
for(int j=0;j<10;j++)
MPI_Comm_rank(MPI_COMM_WORLD,&myProcID);
d2[i] = cos(d[i])*pow(d[i],3.0);
}
double end2 = MPI_Wtime();
std::cout << "w rank: "<< end2-end1 << std::endl;
MPI_Finalize();
return 0;
}
main.cpp:
#include "math.h"
#include "mpi.h"
#include "omp.h"
#include <iostream>
#include "stdlib.h"
int main(int argc,char* args[])
{
int required = MPI_THREAD_MULTIPLE;
int provided = 0;
MPI_Init_thread(NULL,NULL,required,&provided);
if(provided!=required)
{
std::cout << "Error: MPI thread support insufficient! required "<< required << " provided "<< provided;
abort();
}
long n=1000000;
double start = MPI_Wtime();
double *d = new double[n];
double *d2 = new double[n];
#pragma omp parallel for
for(long i=0;i<n;i++)
{
d2[i] = cos(d[i])*pow(d[i],3.0);
}
delete[] d;
delete[] d2;
double end1 = MPI_Wtime();
std::cout << "MPI_THREAD_MULTIPLE w/o rank: "<< end1-start << std::endl;
d = new double[n];
d2 = new double[n];
#pragma omp parallel for
for(long i=0;i<n;i++)
{
int myProcID;
for(int j=0;j<10;j++)
MPI_Comm_rank(MPI_COMM_WORLD,&myProcID);
d2[i] = cos(d[i])*pow(d[i],3.0);
}
double end2 = MPI_Wtime();
std::cout << "MPI_THREAD_MULTIPLE w rank: "<< end2-end1 << std::endl;
MPI_Finalize();
return 0;
}