Hi all,
My original code had some probems to get good vectorizations. Following siggestions in different places (including several INTEL articles), I decided that re-structuring the data layout seems the more promising approach.
Instead of defining an array of complex matrices:
type su3 complex (kind=8) :: comp(3,3) end type type field type (su3), allocatable :: v(:) end type
that makes vectorization complicated (how to vectorize the multiplication of a single 3x3 matrix?), I have decided to invert the orderof the layout.
type grid_complex complex (kind=8), allocatable :: v(:) end type type field type (grid_complex) :: comp(3,3) end type
(i.e. AoS versus SoA data structures). This approach works for me because I *always* have to multiply O(1000) matrices at the same time. The idea was that the this typical operation can now be coded in a routine like the following:
module procedure su3xsu3(g,a,b) type (field), intent (in) :: a,b type (field), intent (inout) :: g integer :: i, j do concurrent (i=1:3,j=1:3) g%comp(i,j)%v = & a%comp(i,1)%v*b%comp(1,j)%v + & a%comp(i,2)%v*b%comp(2,j)%v + & a%comp(i,3)%v*b%comp(3,j)%v end do
return
end procedure su3xsu3
I was assuming that now the data layout is perfectly aligned, and
therefore vectorization straightforward. The surprise is that Intel Fortran
v16 optimization report claims (ifort -opt-report5 -O3 -xavx -align array128byte -opt-assume-safe-padding):
LOOP BEGIN at src/group_su3.f90(56,5) remark #25101: Loop Interchange not done due to: Original Order seems proper remark #25452: Original Order found to be proper, but by a close margin remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/group_su3.f90(56,5) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/group_su3.f90(57,8) remark #15389: vectorization support: reference g has unaligned access remark #15389: vectorization support: reference a has unaligned access remark #15389: vectorization support: reference b has unaligned access remark #15389: vectorization support: reference a has unaligned access remark #15389: vectorization support: reference b has unaligned access remark #15389: vectorization support: reference a has unaligned access remark #15389: vectorization support: reference b has unaligned access remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 2 remark #15309: vectorization support: normalized vectorization overhead 0.239 remark #15300: LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 6 remark #15451: unmasked unaligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 33 remark #15477: vector loop cost: 22.000 remark #15478: estimated potential speedup: 1.490 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=128 LOOP END LOOP END LOOP END
This means, that for some reason beyond my understanding, he claims that variables a and b have the incorrect alignement in memory, and therefore although he vectorizes the loop, he does not expect much improvement.
Since I am asking to use avx instructions (SDIM size of 256 bits), that
can process at the time 2 double precision complex numbers, I would expect a
naive speedup of 2x.
Now if I use the -ipo command, I get (ifort -ipo -opt-report5 -O3 -xavx -align array128byte -opt-assume-safe-padding):
LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8) remark #25101: Loop Interchange not done due to: Original Order seems proper remark #25452: Original Order found to be proper, but by a close margin remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/group_su3.f90(57,8) inlined into src/main.f90(10,8) remark #15388: vectorization support: reference a.COMP.V has aligned access remark #15388: vectorization support: reference c.COMP.V has aligned access remark #15388: vectorization support: reference b.COMP.V has aligned access remark #15388: vectorization support: reference c.COMP.V has aligned access remark #15388: vectorization support: reference b.COMP.V has aligned access remark #15388: vectorization support: reference c.COMP.V has aligned access remark #15388: vectorization support: reference b.COMP.V has aligned access remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 2 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 6 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 33 remark #15477: vector loop cost: 14.000 remark #15478: estimated potential speedup: 2.350 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=128 LOOP END LOOP END LOOP END
The compiler correctly identifies the components v(:) to be aligned and vectorizes the loop.
Now the questions:
1) What am I missing? Why the original loop, without in-linning the function does not vectorize correctly.
2) A more general question: Does all this shufling of data layout makes
sense for you? Could I expect an improvement by using a SoA layout if my
codes uses most the time in small matrix multiplications? Any other hints
for writing efficient code for modern architectures while being standard
conforming?
Many thanks!
A.