Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

pmi_proxy stalls the HPC job

$
0
0

Hi HPC enthusiasts,
We are having a Sandy Bridge cluster of 8 nodes having the following:

Hardware:
1U rackmount enclosure
Intel S2400SC2 board
2 x Xeon E5-2450 processor
96GB ECC DDR3 RDIMM
Intel True Scale QLE7340-CK HCA
500GB Enterprise SATA
36 port QLogic switch
24-port 1GbE switch

Software:
CentOS 6.2 x64
Intel MPI Library 4.1.1.036
Intel Fortran Composer XE 2013.3.163
NetCDF 4.0
FFTW 3.3.3
Open Grid Engine 2011.11.p1
NFS share
Passphraseless SSH from any machine to any machine (meshed)

Of late, whenever we submit the job (home-grown code) either via mpirun direct or through Grid Engine qsub, invariably (~90% times) the job does not start execution, it just appears to stay stalled. On inspection of process runs, we find that randomly few nodes shows 'pmi_proxy' with status 'D' (uninterruptible sleep).

We have tested IMB (Intel MPI Benchmark), test codes (that comes with Grid Engine and Intel MPI) on the cluster both via mpirun and also through qsub, and it functions fine.

What is pmi_proxy process, and how to eliminate stalling of job. Non-functioning of job is driving me crazy. Please excuse me if it is already discussed somewhere, or, if this is not the correct forum. I'm a new novice HPC user.

Any guidance would be appreciated.

My advance thanks for an early and valuable suggestion(s).

With regards
Girish Nair
+91 98457 36460
girishnairisonline <at> gmail <dot> com


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>