Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

MPI_Allreduce is toooo slow

$
0
0

I use the benchmark in Intel MPI (IMB) to measure the performance of MPI_Allreudce over a rack with 25 machines equipped with Infiniband 40G switch. (I use the latest version of parallel studio 2017 on CentOS7 with linux kernel 3.1)

mpiexec.hydra -genvall -n 25 -machinefile ./machines ~/bin/IMB-MPI1 Allreduce -npmin 25 -msglog 26:29 -iter 1000,128

#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 4.1 Update 1, MPI-1 part
#------------------------------------------------------------
# Date                  : Mon Feb 20 16:40:26 2017
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-327.el7.x86_64
# Version               : #1 SMP Thu Nov 19 22:10:57 UTC 2015
# MPI Version           : 3.0

...

# /home/syko/Turbograph-DIST/linux_ver/bin//IMB-MPI1 Allreduce -npmin 25 -msglog 26:29 -iter 1000
#

# Minimum message length in bytes:   0
# Maximum message length in bytes:   536870912
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Allreduce

#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 25
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.12         0.18         0.15
     67108864            2    298859.48    340774.54    329296.10
    134217728            1    619451.05    727700.95    687140.46
    268435456            1   1104426.86   1215415.00   1177512.81
    536870912            1   2217355.97   2396162.03   2331228.14

# All processes entering MPI_Finalize

So I conclude that the performance of MPI_Allreduce is about (# of bytes / sizeof (float) / ElapsedTime) ~= 57 Mega Elementes / sec

This throughput number is far below than I expected. The network bandwidth usage is also much lower than the maximum bandwidth of Infiniband. 
Is this performance number is acceptable in Intel MPI? Otherwise, is there something I can do to improve it?
(I tried varying 'I_MPI_ADJUST_ALLREDUCE', but were not satisfied.)

 

 


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>