Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Intel Trace Collector Crashing with Large Number of Cores

$
0
0

Dear Support,

                     I am currently running on RedHat Linux 6.2 64-bit with Intel compilers 12.1.0 and Intel MPI 4.0.3.008 over Qlogic Infiniband QDR (PSM). I am also using Intel Trace Analyzer and Collector 8.0.3.007.

I am trying to debug an MPI problem when running on a large number of  cores (>6000) and I compile my application with "-check_mpi". My application is mixed FORTRAN, C, and C++ and most MPI calls are in FORTRAN.

I launch my MPI job with the below options:

mpiexec.hydra  -env I_MPI_FABRICS  tmi  -env I_MPI_TMI_PROVIDER  psm  -env I_MPI_DEBUG 5  .......

As soon as I launch the application the trace collector crashes with the below error:

[0] Intel(R) Trace Collector ERROR: cannot create socket: socket(): Too many open files
[32] Intel(R) Trace Collector ERROR: connection closed by peer #0, receiving remaining 8 of 8 bytes failed
 

It works fine on a less number of cores but I need to debug on a large number of cores beyond 6000 cores since that's when my application starts giving me problems with MPI.

Any suggestions on how to overcome this limitation? Is there a way to have the trace collector run over Infiniband instead of TCP sockets?

Thank you for your help.

Mohamad Sindi

EXPEC Advanced Research Center

Saudi Aramco

 

 


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>