Hi,
Summary
While testing scalability of the Quantum Espresso HPC software package, I stumbled on a very strange and annoying problem: when the jobs are run on too many cores, they undergo "sudden death" at some point. "Sudden death" means the job stops with no error message at all, and no core dump. "Too many" and "some point" mean: if the job is run with parallelization parameters above a given limit, it will stop during a given cycle; the higher the parameters, the sooner it will stop. The -np option is the most influential parameter.
Details
I'm compiling and running the PWscf software from Quantum Espresso 6.2.1. The server is one NUMA node with 4 sockets equipped with "Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz" and, initially, 1 DIMM of 16 GB on channel 0 of each socket.
Initially I used Parallel Studio 2019 Initial Release to compile and run, with only Intel MPI parallelization (no threading). Years ago we added these variables to the starting script: OMPI_MCA_mpi_yield_when_idle=1 OMPI_MCA_mpi_paffinity_alone=1 OMP_NUM_THREADS=1 (I guess they're irrelevant, but they're here).
To ensure isolation of the PWscf tasks from other software running on the node, I use the "cset" package tools (Linux Mint 19, clone of Ubuntu 18.04). Sockets 1-3 are devoted to PWscf, socket 0 is devoted to OS and other software. All tests are done with a multiple-of-3 number N of tasks, globally tied to N cores evenly distributed on sockets 1-3 (I don't do anything special to bind each task to a specific core).
I have a reproducible test case: repetition of a few examples showed that the running time is precise to approx. ±10s (it ranges from approx. 1h30 to 3h depending on N) for jobs that complete. For failing jobs, the failure always occurs in the same cycle, ±1, for given values of the -np mpirun option and of the -ndiag PWscf option.
Test results
First series: all jobs with -np values ≥51 fail, irrespective of -ndiag; they fail in the 16th or 17th cycle, irrespective of -ndiag.
We then upgraded the node to 3 DIMMs of 16, 8, and 8 GB on channels 0, 1, and 2 of each socket.
Second series: timings are better (confirming our hypothesis of bandwidth limiting performance, and of our provider misconfiguring the node by populating only 1 DIMM per socket). Jobs complete with -np values up to 54 or 57, depending on the -ndiag value. Failing jobs fail sooner when -np is higher. E.g. -np 57 leads to failure in the 59th cycle, -np 72 to failure in the 44th cycle (for a given -ndiag).
Thus it seems like the problem has to do with the amount of memory.
I then thought I'd try with an updated Parallel Studio, but stumbled on this bug I reported. With the workaround suggested there (I_MPI_HYDRA_TOPOLIB=ipl), I ran a third series, PWscf compiled with PS 2019 update 3: things are worse, the program outputs nothing at all, although the tasks use 100% CPU!
Fourth series: PWscf compiled with PS 2019 update 1: with I_MPI_HYDRA_TOPOLIB=ipl the program runs as usual, but the sudden deaths are still here and now they look quite unpredictable (yet still reproducible): -np 33 fails but -np 57 (or 60) completes (for a given -ndiag). I haven't done extensive tests for all -np values, but I'm not very inclined to do so since this series seems to have worse outcomes.
Please advise on what to do next. Thanks.