I use the benchmark in Intel MPI (IMB) to measure the performance of MPI_Allreudce over a rack with 25 machines equipped with Infiniband 40G switch. (I use the latest version of parallel studio 2017 on CentOS7 with linux kernel 3.1)
mpiexec.hydra -genvall -n 25 -machinefile ./machines ~/bin/IMB-MPI1 Allreduce -npmin 25 -msglog 26:29 -iter 1000,128
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 4.1 Update 1, MPI-1 part
#------------------------------------------------------------
# Date : Mon Feb 20 16:40:26 2017
# Machine : x86_64
# System : Linux
# Release : 3.10.0-327.el7.x86_64
# Version : #1 SMP Thu Nov 19 22:10:57 UTC 2015
# MPI Version : 3.0
...
# /home/syko/Turbograph-DIST/linux_ver/bin//IMB-MPI1 Allreduce -npmin 25 -msglog 26:29 -iter 1000
#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 536870912
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Allreduce
#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 25
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.12 0.18 0.15
67108864 2 298859.48 340774.54 329296.10
134217728 1 619451.05 727700.95 687140.46
268435456 1 1104426.86 1215415.00 1177512.81
536870912 1 2217355.97 2396162.03 2331228.14
# All processes entering MPI_Finalize
So I conclude that the performance of MPI_Allreduce is about (# of bytes / sizeof (float) / ElapsedTime) ~= 57 Mega Elementes / sec
This throughput number is far below than I expected. The network bandwidth usage is also much lower than the maximum bandwidth of Infiniband.
Is this performance number is acceptable in Intel MPI? Otherwise, is there something I can do to improve it?
(I tried varying 'I_MPI_ADJUST_ALLREDUCE', but were not satisfied.)