Floating Point Exception Overflow and OpenMPI equiv tunning?

I am working with star ccm+ 2019.1.1 Build 14.02.012
CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with star ccm+)
Cisco UCS cluster using USNIC fabric over 10gbe
Intel(R) Xeon(R) CPU E5-2698
7 nodes, 280 cores

enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installed
usnic RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installed
enic modinfo version: 3.2.210.22
enic loaded module version: 3.2.210.22
usnic_verbs modinfo version: 3.2.158.15
usnic_verbs loaded module version: 3.2.158.15
libdaplusnic RPM version 2.0.39cisco3.2.112.8 installed
libfabric RPM version 1.6.0cisco3.2.112.9.rhel7u6 installed

On runs less than 5 hours, everything works flawlessly and is quite fast.

However when running with 280 cores at or around 5 hours into a job, the longer jobs die with the floating point exception.
The same job completes fine with 140 cores, but takes about 14 hours to finish.
Also I am using PBS Pro with 99 hour wall time

------------------
Turbulent viscosity limited on 56 cells in Region
A floating point exception has occurred: floating point exception [Overflow]. The specific cause cannot be identified. Please refer to the troubleshooting section of the User's Guide.
Context: star.coupledflow.CoupledImplicitSolver
Command: Automation.Run
error: Server Error
------------------

I have been doing some reading and some say that using other MPI are more stable with Star CCM.

I have not ruled out that I am missing some parameters or tuning with Intel MPI as this is a new cluster.

I am also trying to make Open MPI work. I have openmpi compiled and it runs, however only with very small number of CPU. Anything over about 2 cores per node it hangs indefinately.

I have compiled Open MPI 3.1.3 from https://www.open-mpi.org/ because this is what Star CCM version I am running supports. I am telling star to use the open mpi that I installed so it can support the Cisco USNIC fabric, which I can verify using Cisco native tools. Note that star also ships with openmpi however

I am thinking that I need to tune OpenMPI, which was also requried with Intel MPI.

With Intel MPI, jobs with more than about 100 cores would hang until I added these parameters:

reference: https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technolog...
reference: https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-a...

export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208
export I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208
export I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704
export I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704
export I_MPI_DAPL_UD_RNDV_EP_NUM=2
export I_MPI_DAPL_UD_REQ_EVD_SIZE=2000
export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096
export I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647

After adding these parms I can scale to 280 cores and it runs very fast, up until the point where it gets the floating point exception.

I am banging my head against a wall trying to find equivelant turning parms for Open MPI.

I have listed all the MCA available with Open using MCA, and have tried setting these parms with no success.

btl_max_send_size = 4096
btl_usnic_eager_limit = 2147483647
btl_usnic_rndv_eager_limit = 2147483647
btl_usnic_sd_num = 8208
btl_usnic_rd_num = 8208
btl_usnic_prio_sd_num = 8704
btl_usnic_prio_rd_num = 8704
btl_usnic_pack_lazy_threshold = -1

Does anyone have any advice or ideas for:

1.) The floating point overflow issue
and
2.) Know of equivelant tuning parms for Open MPI

Many thanks in advance

Floating Point Exception Overflow and OpenMPI equiv tunning?

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112