Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Hybrid OpenMP/MPI doesn't work with the Intel compiler

$
0
0

Greetings,

We provide the full Intel Cluster compiler suite of software on our cluster which uses Torque/Moab. Last week one of my users complained that his Hybrid OpenMP/MPI code wasn't running properly. The OpenMP portion was running great, but the MPI wasn't splitting the job up across nodes. So I dug into it a bit. Sure enough, the job launches on $X nodes but each node gets the full range of work and isn't split up.

To ensure that this was not a problem in his code, I slapped together a really basic hello world script with MPI and OpenMP. I confirmed the same behaviour. Not only that but it works just fine if I compile it with GCC instead! Hrm. Well, I haven't upgraded the Intel tool set in a few months and I know at least one update; maybe that is the problem. So I updated all of the toolsets that I have access to. _Everything_ is now up to date (as of yesterday). Try again and the exact same results.

$ mpif90 --version
GNU Fortran (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3)
$ mpiifort --version
ifort (IFORT) 14.0.0 20130728

Well maybe it is my code. I am more of a sysadmin then a programmer. I found this code snippet out in the wild and tried it: http://www.rcac.purdue.edu/userinfo/resources/common/compile/hybrid_hell...

Compile: mpiifort -openmp -mt_mpi hybrid_hello.f90
Run with the option of two hosts and two OpenMP threads [ `export OMP_NUM_THREADS=2` ] for testing.
Output of run:

SERIAL REGION:     Runhost:node03                           Rank:           0  of            1 ranks, Thread:           0  of            1  threads   hello, world
PARALLEL REGION:   Runhost:node03                           Rank:           0  of            1 ranks, Thread:           0  of            2  threads   hello, world
PARALLEL REGION:   Runhost:node03                           Rank:           0  of            1 ranks, Thread:           1  of            2  threads   hello, world
SERIAL REGION:     Runhost:node03                           Rank:           0  of            1 ranks, Thread:           0  of            1  threads   hello, world
SERIAL REGION:     Runhost:node01                           Rank:           0  of            1 ranks, Thread:           0  of            1  threads   hello, world
PARALLEL REGION:   Runhost:node01                           Rank:           0  of            1 ranks, Thread:           0  of            2  threads   hello, world
PARALLEL REGION:   Runhost:node01                           Rank:           0  of            1 ranks, Thread:           1  of            2  threads   hello, world
SERIAL REGION:     Runhost:node01                           Rank:           0  of            1 ranks, Thread:           0  of            1  threads   hello, world

I am given two hosts by Torque/Moab and I get two OpenMP threads, but there is only 1 rank! To quote Adam Savage "Well, there's the problem!" For whatever reason, each node seems to think that it is the only MPI thread. This is pretty much what I had been seeing, but it is much better code then mine so I feel better about showing its results. :-)

What happens with GCC?
Compile: mpif90 -lgomp -fopenmp hybrid_hello.f90
Run with the exact same script/submission process as before.
Output of run:

SERIAL REGION:     Runhost:node01                           Rank:           0  of            2 ranks, Thread:           0  of            1  threads   hello, world
PARALLEL REGION:   Runhost:node01                           Rank:           0  of            2 ranks, Thread:           0  of            2  threads   hello, world
PARALLEL REGION:   Runhost:node01                           Rank:           0  of            2 ranks, Thread:           1  of            2  threads   hello, world
SERIAL REGION:     Runhost:node01                           Rank:           0  of            2 ranks, Thread:           0  of            1  threads   hello, world
SERIAL REGION:     Runhost:node02                           Rank:           1  of            2 ranks, Thread:           0  of            1  threads   hello, world
PARALLEL REGION:   Runhost:node02                           Rank:           1  of            2 ranks, Thread:           0  of            2  threads   hello, world
PARALLEL REGION:   Runhost:node02                           Rank:           1  of            2 ranks, Thread:           1  of            2  threads   hello, world
SERIAL REGION:     Runhost:node02                           Rank:           1  of            2 ranks, Thread:           0  of            1  threads   hello, world

Well, look at that. It runs just fine and as expected with GCC but the Intel compiler isn't running the MPI ranking right at all. At this point, I am fairly certain it is an Intel compiler issue. Knowing that, I crashed the boards and info that Intel provides looking for answers. I found a lot but nothing really jumped out at me until I found this nifty hello world application: http://software.intel.com/en-us/articles/beginning-hybrid-mpiopenmp-deve...

Now for retesting using the Intel provided code. Surely this will run right. After all, the guide uses the same compile options I have been using!

Compile: mpiifort -openmp -mt_mpi hybrid-hello.f90
Run with the option of two hosts and two OpenMP threads [ `export OMP_NUM_THREADS=2` ] for testing.
Output of run:

Hello from thread   0 of   2 in rank   0 of   1 on node01
Hello from thread   1 of   2 in rank   0 of   1 on node01
Hello from thread   0 of   2 in rank   0 of   1 on node03
Hello from thread   1 of   2 in rank   0 of   1 on node03

Not a great start. Same output I have been getting. What happens with GCC?

Compile: mpif90 -lgomp -fopenmp hybrid-hello.f90
Run with the exact same script/submission process as before.
Output of run:

Hello from thread   1 of   2 in rank   0 of   1 on node02
Hello from thread   0 of   2 in rank   0 of   1 on node02
Hello from thread   0 of   2 in rank   1 of   2 on node03
Hello from thread   1 of   2 in rank   1 of   2 on node03

It works with GCC! What? I am now zero for three on the Hybrid OpenMP/MPI problem (Well, zero for four if you count the user who brought this to my attention). What else could it be? I wonder if it doesn't like the mpirun that I got with Torque/Moab. I do have access to (and have already installed) the Intel MPI toolsets. Well instead of using the Torque mpirun, I will try the Intel mpirun!

And....no. Not only does it not have the pernode parameter but it seems to be missing a few other features as well...I finally get it to run with `mpirun -bynode -np 2 a.out` because Torque is allocating to the job 2 cores on 2 hosts and I want it to *only* launch one MPI job per host. Anyway, it finally runs...with the exact same output as before (though I am still not convinced that I have this limited version of mpirun from Intel configured with the right options yet and I can't use mpiexec because the Intel mpiexec doesn't appear to recognize the Torque/Moab directives).

So the question is, what am I doing wrong? I can't seem to get the Intel compiled version of this code to run in a proper Hybrid OpenMP/MPI configuration. It obviously is working for GCC so I fairly convinced it isn't the code. The cluster is running great with other Intel MPI jobs using the Torque provided mpirun, so I don't think it is that (though it could be I might get better perofmance if I can get the Intel mpirun tweaked and working properly). That really only leaves the Intel compiler left and I am stumped on why it isn't working.

Any help would be greatly appreciated.

Thank you!


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>