Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

MPI_Comm_rank , MPI_THREAD_MULTIPLE, and performance

$
0
0

Hi everyone,

We found the following behavior in Intel MPI (5.0.3) using both the intel compilers and gcc:

In an OpenMP-MPI environment, the performance of MPI_Comm_rank goes down if MPI is initialized using MPI_THREAD_MULTIPLE. I attach two files to show the behavior. They can be compiled with

mpiicpc main.cpp -o test.exe -openmp
mpiicpc main2.cpp -o test_nothreads.exe -openmp

Both executables do a simple parallelized for loop two times; the first time, an arithmetic operation is performed a lot of times. The second time, there are additional calls to MPI_Comm_rank within the loop.

test.exe uses MPI_THREAD_MULTIPLE. Here is a typical example of the runtime (with one thread, OMP_NUM_THREADS=1) for the two loops:

MPI_THREAD_MULTIPLE w/o rank: 0.0411851
MPI_THREAD_MULTIPLE w rank: 1.03309

test_no_threads.exe doesn't use MPI_THREAD_MULTIPLE, and we get:

w/o rank: 0.0452909
w rank: 0.181268

This slowdown gets a lot more severe if we do this with e.g. 16 OpenMP threads:

MPI_THREAD_MULTIPLE w rank: 6.07238
versus
w rank: 0.345186

Using a profiler we find that  there is spin lock in MPI_Comm_rank that is responsible for the slowdown.

I see that with MPI_THREAD_MULTIPLE, there needs to be some locking for MPI operations. However, I do not see why this should be the case in MPI_Comm_rank, since I assume this to be a rather local operation - e.g. in OpenMPI, this is internally just returning a member of a struct, namely the process ID.
Therefore, I would like to understand if this is a known problem or a bug.

All the best, Christoph.

I can not attach cpp files for some reason, so here is just the code:

main2.cpp:

#include "mpi.h"
#include "omp.h"

#include <iostream>
#include "math.h"
#include "stdlib.h"

int main(int argc,char* args[])
{
	MPI_Init(NULL,NULL);
	long n=1000000;
	double start = MPI_Wtime();
	double *d = new double[n];
	double *d2 = new double[n];

#pragma omp parallel for
	for(long i=0;i<n;i++)
	{

		d2[i] = cos(d[i])*pow(d[i],3.0);


	}
	delete[] d;
	delete[] d2;


	double end1 = MPI_Wtime();
	std::cout << "w/o rank: "<< end1-start << std::endl;

	d = new double[n];
	d2 = new double[n];

#pragma omp parallel for
	for(long i=0;i<n;i++)
	{
		int myProcID;
		for(int j=0;j<10;j++)
	          MPI_Comm_rank(MPI_COMM_WORLD,&myProcID);
		d2[i] = cos(d[i])*pow(d[i],3.0);


	}

	double end2 = MPI_Wtime();
	std::cout << "w rank: "<< end2-end1 << std::endl;
	MPI_Finalize();

	return 0;

}

 

main.cpp:

#include "math.h"
#include "mpi.h"
#include "omp.h"

#include <iostream>
#include "stdlib.h"

int main(int argc,char* args[])
{
	int required = MPI_THREAD_MULTIPLE;
	  int provided = 0;
	    MPI_Init_thread(NULL,NULL,required,&provided);
	      if(provided!=required)
		        {
				      std::cout << "Error: MPI thread support insufficient! required "<< required << " provided "<< provided;
   			            abort();

		        }
	long n=1000000;
	double start = MPI_Wtime();
	double *d = new double[n];
	double *d2 = new double[n];

#pragma omp parallel for
	for(long i=0;i<n;i++)
	{

		d2[i] = cos(d[i])*pow(d[i],3.0);


	}
	delete[] d;
	delete[] d2;


	double end1 = MPI_Wtime();
	std::cout << "MPI_THREAD_MULTIPLE w/o rank: "<< end1-start << std::endl;

	d = new double[n];
	d2 = new double[n];

#pragma omp parallel for
	for(long i=0;i<n;i++)
	{
		int myProcID;
		for(int j=0;j<10;j++)
	          MPI_Comm_rank(MPI_COMM_WORLD,&myProcID);
		d2[i] = cos(d[i])*pow(d[i],3.0);


	}

	double end2 = MPI_Wtime();
	std::cout << "MPI_THREAD_MULTIPLE w rank: "<< end2-end1 << std::endl;
	MPI_Finalize();

	return 0;

}

 


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>