Quantum Espresso job dies silently

Hi,

Summary

While testing scalability of the Quantum Espresso HPC software package, I stumbled on a very strange and annoying problem: when the jobs are run on too many cores, they undergo "sudden death" at some point. "Sudden death" means the job stops with no error message at all, and no core dump. "Too many" and "some point" mean: if the job is run with parallelization parameters above a given limit, it will stop during a given cycle; the higher the parameters, the sooner it will stop. The -np option is the most influential parameter.

Details

I'm compiling and running the PWscf software from Quantum Espresso 6.2.1. The server is one NUMA node with 4 sockets equipped with "Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz" and, initially, 1 DIMM of 16 GB on channel 0 of each socket.

Initially I used Parallel Studio 2019 Initial Release to compile and run, with only Intel MPI parallelization (no threading). Years ago we added these variables to the starting script: OMPI_MCA_mpi_yield_when_idle=1 OMPI_MCA_mpi_paffinity_alone=1 OMP_NUM_THREADS=1 (I guess they're irrelevant, but they're here).

To ensure isolation of the PWscf tasks from other software running on the node, I use the "cset" package tools (Linux Mint 19, clone of Ubuntu 18.04). Sockets 1-3 are devoted to PWscf, socket 0 is devoted to OS and other software. All tests are done with a multiple-of-3 number N of tasks, globally tied to N cores evenly distributed on sockets 1-3 (I don't do anything special to bind each task to a specific core).

I have a reproducible test case: repetition of a few examples showed that the running time is precise to approx. ±10s (it ranges from approx. 1h30 to 3h depending on N) for jobs that complete. For failing jobs, the failure always occurs in the same cycle, ±1, for given values of the -np mpirun option and of the -ndiag PWscf option.

Test results

First series: all jobs with -np values ≥51 fail, irrespective of -ndiag; they fail in the 16th or 17th cycle, irrespective of -ndiag.

We then upgraded the node to 3 DIMMs of 16, 8, and 8 GB on channels 0, 1, and 2 of each socket.

Second series: timings are better (confirming our hypothesis of bandwidth limiting performance, and of our provider misconfiguring the node by populating only 1 DIMM per socket). Jobs complete with -np values up to 54 or 57, depending on the -ndiag value. Failing jobs fail sooner when -np is higher. E.g. -np 57 leads to failure in the 59th cycle, -np 72 to failure in the 44th cycle (for a given -ndiag).

Thus it seems like the problem has to do with the amount of memory.

I then thought I'd try with an updated Parallel Studio, but stumbled on this bug I reported. With the workaround suggested there (I_MPI_HYDRA_TOPOLIB=ipl), I ran a third series, PWscf compiled with PS 2019 update 3: things are worse, the program outputs nothing at all, although the tasks use 100% CPU!

Fourth series: PWscf compiled with PS 2019 update 1: with I_MPI_HYDRA_TOPOLIB=ipl the program runs as usual, but the sudden deaths are still here and now they look quite unpredictable (yet still reproducible): -np 33 fails but -np 57 (or 60) completes (for a given -ndiag). I haven't done extensive tests for all -np values, but I'm not very inclined to do so since this series seems to have worse outcomes.

Please advise on what to do next. Thanks.

Quantum Espresso job dies silently

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112