Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Intel MPI real-time performance problem

$
0
0

Hello:
In my case ,i found sometimes MPI_Gather() take more than 5000-40000 cpu crcle,but normally MPI_Gather() only take about 2000 cpu crcle.

I can confirm that  there is no timer interrupt or other interrupts  to  disturb MPI_Gather(), i also try to use mlockall(), use my own malloc()  to replace i_malloc and i_free,but it not work.

When call MPI_Gather(), my programme  need it return in a determinacy time, i don't know is there ant thing i can do to improve MPI real-time performance,or is there any tools can help me to find why this function take so long some times.

OS:linux3.10 

cpuinfo:  i use isolcpus ,so the cpuinfo cmd may get wrong info,it 8core16Threads 

Intel(R) processor family information utility, Version 4.1 Update 3 Build 20140124
Copyright (C) 2005-2014 Intel Corporation.  All rights reserved.
=====  Processor composition  =====
Processor name    : Intel(R) Xeon(R)  E5-2660 0 
Packages(sockets) : 1
Cores             : 1
Processors(CPUs)  : 1
Cores per package : 1
Threads per core  : 1
=====  Processor identification  =====
Processor       Thread Id.      Core Id.        Package Id.
0               0               0               0   
=====  Placement on packages  =====
Package Id.     Core Id.        Processors
0               0               0
=====  Cache sharing  =====
Cache   Size            Processors
L1      32  KB          no sharing
L2      256 KB          no sharing
L3      20  MB          no sharing

test_program like this:
...
       t1= rdtsc;                                                                                   
        Ierr = MPI_Bcast(&nn, 1, MPI_INTEGER, 0, MPI_COMM_WORLD);                                               
        t2= rdtsc;                                                                                      
        if(t2>t1&& (t2-t1)/1000 >5)  record_it(t2-t1);
...

 

mpirun -n 4 -env I_MPI_DEBUG=4 ./my_test

[0] MPI startup(): Intel(R) MPI Library, Version 4.1 Update 3  Build 20140124
[0] MPI startup(): Copyright (C) 2003-2014 Intel Corporation.  All rights reserved.
[1] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[0] MPI startup(): Device_reset_idx=8
[0] MPI startup(): Allgather: 1: 1-1 & 0-2147483647
[0] MPI startup(): Allgather: 4: 2-4 & 0-2147483647
[0] MPI startup(): Allgather: 1: 5-10 & 0-2147483647
[0] MPI startup(): Allgather: 4: 11-22 & 0-2147483647
[0] MPI startup(): Allgather: 1: 23-469 & 0-2147483647
[0] MPI startup(): Allgather: 4: 470-544 & 0-2147483647
[0] MPI startup(): Allgather: 1: 545-3723 & 0-2147483647
[0] MPI startup(): Allgather: 3: 3724-59648 & 0-2147483647
[0] MPI startup(): Allgather: 1: 59649-3835119 & 0-2147483647
[0] MPI startup(): Allgather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allgatherv: 1: 0-1942 & 0-2147483647
[0] MPI startup(): Allgatherv: 3: 1942-128426 & 0-2147483647
[0] MPI startup(): Allgatherv: 4: 128426-193594 & 0-2147483647
[0] MPI startup(): Allgatherv: 3: 193594-454523 & 0-2147483647
[0] MPI startup(): Allgatherv: 4: 454523-561981 & 0-2147483647
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 0-6 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 6-13 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 13-37 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 37-104 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 104-409 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 409-5708 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 5708-12660 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 12660-61166 & 0-2147483647
[0] MPI startup(): Allreduce: 6: 61166-74718 & 0-2147483647
[0] MPI startup(): Allreduce: 8: 74718-163640 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 163640-355186 & 0-2147483647
[0] MPI startup(): Allreduce: 6: 355186-665233 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 0-1 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 2-2 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 3-25 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 26-48 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 49-1826 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 1827-947308 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 947309-1143512 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 1143513-3715953 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Bcast: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 3: 1-2045 & 0-2147483647
[0] MPI startup(): Gather: 2: 2046-3072 & 0-2147483647
[0] MPI startup(): Gather: 3: 3073-313882 & 0-2147483647
[0] MPI startup(): Gather: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gatherv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 4: 0-5 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 1: 5-162 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 3: 162-81985 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 2: 81985-690794 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 5: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce: 1: 4-11458 & 0-2147483647
[0] MPI startup(): Reduce: 5: 11459-22008 & 0-2147483647
[0] MPI startup(): Reduce: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 3: 1-24575 & 0-2147483647
[0] MPI startup(): Scatter: 2: 24576-37809 & 0-2147483647
[0] MPI startup(): Scatter: 3: 37810-107941 & 0-2147483647
[0] MPI startup(): Scatter: 2: 107942-399769 & 0-2147483647
[0] MPI startup(): Scatter: 3: 399770-2150807 & 0-2147483647
[0] MPI startup(): Scatter: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatterv: 0: 0-2147483647 & 0-2147483647
[1] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[3] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
mrloop is using core 3
mrloop is using core 5
[2] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
mrloop is using core 4
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       3477     zte        0
[0] MPI startup(): 1       3478     zte        0
[0] MPI startup(): 2       3479     zte        0
[0] MPI startup(): 3       3480     zte        0
[0] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[0] MPI startup(): I_MPI_DEBUG=8
[0] MPI startup(): I_MPI_PIN_MAPPING=4:0 0,1 0,2 0,3 0

 core 2: Now the following data is the statistics in 1000000 step:
         3:997331  4:2168   5:1  46:1 
core 3: Now the following data is the statistics in 1000000 step:
         1:4497  2:995003  3:1 
core 4: Now the following data is the statistics in 1000000 step:
         1:2767  2:996733  16:1 
core 5: Now the following data is the statistics in 1000000 step:
         1:1  2:999070  3:430 

 

3:997331 means : 997331 times it take 3000 cpu cycle

4:2168 means: 2168 times it take 4000 cpu cycle

you can see , one time it take 46000 cpu cycle, but in other 999501 times ,it only takee 1000-3000 cpu cycle

 

 

 

 

 


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>