Hi,
We found that when using the wait mode with shm:ofa fabrics, the processes of the MPI program use more memory than the other configurations. In some situations, the program crashes with memory exhaustion. The issue seems reproducible in a couple of programs, including WRF and CMAQ. We tried using the intel's HPL benchmark program to reproduce the problem, though not crashing, yet getting similar warning messages as follows:
================================================================================
HPLinpack 2.1 -- High-Performance Linpack benchmark -- October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 100000
NB : 168
PMAP : Row-major process mapping
P : 24
Q : 1
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
[...]
Column=057624 Fraction=0.575 Mflops=296889.01
Column=059640 Fraction=0.595 Mflops=295321.12
Column=061656 Fraction=0.615 Mflops=293425.73
[15] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[3] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[18] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[0] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
Column=063504 Fraction=0.635 Mflops=291756.47
[6] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[12] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[21] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[13] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[4] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[9] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[1] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[22] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[7] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[10] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[14] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[5] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[11] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[2] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[16] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[8] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[19] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[23] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[17] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
Column=065520 Fraction=0.655 Mflops=290125.43
[20] MPIU_Handle_indirect_init(): indirect_size 1024 exceeds indirect_max_size 1024. pool will be extended
[...]
The test was done on CentOS6.7 with Intel MPI 5.0.2.
Thank you very much,
regards,
tofu