Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

Problem vectorizing FORTRAN code

$
0
0

 

Hi all,

My original code had some probems to get good vectorizations. Following siggestions in different places (including several INTEL articles), I decided that re-structuring the data layout seems the more promising approach.

 Instead of defining an array of complex matrices:

 

 type su3
     complex (kind=8) :: comp(3,3)
  end type

  type field
    type (su3), allocatable :: v(:)
  end type

  that makes vectorization complicated (how to vectorize the multiplication of a single 3x3 matrix?), I have decided to invert the orderof the layout.

 type grid_complex
    complex (kind=8), allocatable :: v(:)
  end type

  type field
    type (grid_complex) :: comp(3,3)
  end type

   (i.e. AoS versus SoA data structures). This approach works for me because I *always* have to multiply O(1000) matrices at the same time. The idea was that the this typical operation can now be coded in a routine like the following:

 module procedure su3xsu3(g,a,b)
    type (field), intent (in) :: a,b
    type (field), intent (inout) :: g
    integer :: i, j

    do concurrent (i=1:3,j=1:3)
       g%comp(i,j)%v = &
            a%comp(i,1)%v*b%comp(1,j)%v + &
            a%comp(i,2)%v*b%comp(2,j)%v + &
            a%comp(i,3)%v*b%comp(3,j)%v
    end do

     return
  end procedure su3xsu3

  I was assuming that now the data layout is perfectly aligned, and
therefore vectorization straightforward. The surprise is that Intel Fortran
v16 optimization report claims (ifort -opt-report5 -O3 -xavx  -align array128byte -opt-assume-safe-padding):

LOOP BEGIN at src/group_su3.f90(56,5)
   remark #25101: Loop Interchange not done due to: Original Order seems proper
   remark #25452: Original Order found to be proper, but by a close margin
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at src/group_su3.f90(56,5)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at src/group_su3.f90(57,8)
         remark #15389: vectorization support: reference g has unaligned access
         remark #15389: vectorization support: reference a has unaligned access
         remark #15389: vectorization support: reference b has unaligned access
         remark #15389: vectorization support: reference a has unaligned access
         remark #15389: vectorization support: reference b has unaligned access
         remark #15389: vectorization support: reference a has unaligned access
         remark #15389: vectorization support: reference b has unaligned access
         remark #15381: vectorization support: unaligned access used inside loop body
         remark #15305: vectorization support: vector length 2
         remark #15399: vectorization support: unroll factor set to 2
         remark #15309: vectorization support: normalized vectorization overhead 0.239
         remark #15300: LOOP WAS VECTORIZED
         remark #15450: unmasked unaligned unit stride loads: 6
         remark #15451: unmasked unaligned unit stride stores: 1
         remark #15475: --- begin vector loop cost summary ---
         remark #15476: scalar loop cost: 33
         remark #15477: vector loop cost: 22.000
         remark #15478: estimated potential speedup: 1.490
         remark #15488: --- end vector loop cost summary ---
         remark #25015: Estimate of max trip count of loop=128
      LOOP END
   LOOP END
LOOP END

 

  This means, that for some reason beyond my understanding, he claims that variables a and b have the incorrect alignement in memory, and therefore although he vectorizes the loop, he does not expect much improvement.

  Since I am asking to use avx instructions (SDIM size of 256 bits), that
can process at the time 2 double precision complex numbers, I would expect a
naive speedup of 2x.

Now if I use the -ipo command, I get (ifort -ipo -opt-report5 -O3 -xavx  -align array128byte -opt-assume-safe-padding):

LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8)
   remark #25101: Loop Interchange not done due to: Original Order seems proper
   remark #25452: Original Order found to be proper, but by a close margin
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at src/group_su3.f90(57,8) inlined into src/main.f90(10,8)
         remark #15388: vectorization support: reference a.COMP.V has aligned access
         remark #15388: vectorization support: reference c.COMP.V has aligned access
         remark #15388: vectorization support: reference b.COMP.V has aligned access
         remark #15388: vectorization support: reference c.COMP.V has aligned access
         remark #15388: vectorization support: reference b.COMP.V has aligned access
         remark #15388: vectorization support: reference c.COMP.V has aligned access
         remark #15388: vectorization support: reference b.COMP.V has aligned access
         remark #15305: vectorization support: vector length 2
         remark #15399: vectorization support: unroll factor set to 2
         remark #15300: LOOP WAS VECTORIZED
         remark #15448: unmasked aligned unit stride loads: 6
         remark #15449: unmasked aligned unit stride stores: 1
         remark #15475: --- begin vector loop cost summary ---
         remark #15476: scalar loop cost: 33
         remark #15477: vector loop cost: 14.000
         remark #15478: estimated potential speedup: 2.350
         remark #15488: --- end vector loop cost summary ---
         remark #25015: Estimate of max trip count of loop=128
      LOOP END
   LOOP END
LOOP END

The compiler correctly identifies the components v(:) to be aligned and vectorizes the loop.

Now the questions:

  1) What am I missing? Why the original loop, without in-linning the function does not vectorize correctly.

  2) A more general question: Does all this shufling of data layout makes
sense for you? Could I expect an improvement by using a SoA layout if my
codes uses most the time in small matrix multiplications? Any other hints
for writing efficient code for modern architectures while being standard
conforming?

  Many thanks!

A.


Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>