Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Mpirun is treating -perhost, -ppn, -grr the same: always round-robin

$
0
0

Our cluster has 2 Haswell sockets per node, each with 12 cores (24 cores/node).

Using: intel/15.1.133, impi/5.0.3.048

Irrespective of which of the options mentioned in the subject line are used, ranks are always being placed in round-robin fashion.  The commands are being run in batch job that generates a host file that contains lines like the following when submitted with:

qsub -l nodes=2:ppn=1 ...

 

tfe02.% cat hostfile
t0728
t0731
tfe02.%

As an aside, looks like "-ordered-output" is also being ignored.  I understand that is a little difficult to achieve, but just wanted to use that for better readability.  So please note that the ranks are not printed out in order.

With "-perhost 2" I was expecting ranks 0 on 1 to be on the same node:

-------------

cat /var/spool/torque/aux//889322.bqs5
s0014
s0015
mpirun -ordered-output -np 4 -perhost 2 ./hello_mpi_c-intel-impi
Hello from rank 01 out of 4; procname = s0015, cpuid = 12
Hello from rank 03 out of 4; procname = s0015, cpuid = 24
Hello from rank 02 out of 4; procname = s0014, cpuid = 0
Hello from rank 00 out of 4; procname = s0014, cpuid = 12
---------

The help output from mpirun indicates "-perhost" and "-ppn" are equivalent:

----------

cat /var/spool/torque/aux//889321.bqs5
s0014
s0015
mpirun -ordered-output -np 4 -ppn 2 ./hello_mpi_c-intel-impi
Hello from rank 00 out of 4; procname = s0014, cpuid = 12
Hello from rank 02 out of 4; procname = s0014, cpuid = 0
Hello from rank 01 out of 4; procname = s0015, cpuid = 12
Hello from rank 03 out of 4; procname = s0015, cpuid = 24

--------

Again, "-grr" output is not what was expected:

----------------

cat /var/spool/torque/aux//889323.bqs5
s0014
s0015
mpirun -ordered-output -np 4 -grr 2 ./hello_mpi_c-intel-impi
Hello from rank 02 out of 4; procname = s0014, cpuid = 2
Hello from rank 00 out of 4; procname = s0014, cpuid = 12
Hello from rank 03 out of 4; procname = s0015, cpuid = 24
Hello from rank 01 out of 4; procname = s0015, cpuid = 12
 

I'm including code that has not been cleaned up below :-(

Please ignore parts that are note relevant.

#include <stdio.h>
#include <mpi.h>
#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include <sched.h>

int main(int argc, char **argv)
{
   int ierr, myid, npes;
   int len, i;
   char name[MPI_MAX_PROCESSOR_NAME];

   ierr = MPI_Init(&argc, &argv);
#ifdef MACROTEST
#define MACROTEST 10
#endif
   ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid);
   ierr = MPI_Comm_size(MPI_COMM_WORLD, &npes);
   ierr = MPI_Get_processor_name( name, &len );

#ifdef SLEEP
   for (i=1; i<1e1150; i++)
     ;
#endif

     printf("Hello from rank %2.2d out of %d; procname = %s, cpuid = %d\n", myid, npes, name, sched_getcpu());

#ifdef MACROTEST
     printf("Test Macro: %d\n", MACROTEST);
#endif
#ifdef BUG
     {
       int* x = (int*)malloc(10 * sizeof(int));
       x[10] = 0;        // problem 1: heap block overrun
       printf("Print something %d\n",x[10]);
     }                    // problem 2: memory leak -- x not freed
#endif

   ierr = MPI_Finalize();

}

 


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>