Problem vectorizing FORTRAN code

Hi all,

My original code had some probems to get good vectorizations. Following siggestions in different places (including several INTEL articles), I decided that re-structuring the data layout seems the more promising approach.

Instead of defining an array of complex matrices:

 type su3
     complex (kind=8) :: comp(3,3)
  end type

  type field
    type (su3), allocatable :: v(:)
  end type

that makes vectorization complicated (how to vectorize the multiplication of a single 3x3 matrix?), I have decided to invert the orderof the layout.

 type grid_complex
    complex (kind=8), allocatable :: v(:)
  end type

  type field
    type (grid_complex) :: comp(3,3)
  end type

(i.e. AoS versus SoA data structures). This approach works for me because I *always* have to multiply O(1000) matrices at the same time. The idea was that the this typical operation can now be coded in a routine like the following:

 module procedure su3xsu3(g,a,b)
    type (field), intent (in) :: a,b
    type (field), intent (inout) :: g
    integer :: i, j

    do concurrent (i=1:3,j=1:3)
       g%comp(i,j)%v = &
            a%comp(i,1)%v*b%comp(1,j)%v + &
            a%comp(i,2)%v*b%comp(2,j)%v + &
            a%comp(i,3)%v*b%comp(3,j)%v
    end do

return
end procedure su3xsu3

I was assuming that now the data layout is perfectly aligned, and
therefore vectorization straightforward. The surprise is that Intel Fortran
v16 optimization report claims (ifort -opt-report5 -O3 -xavx -align array128byte -opt-assume-safe-padding):

LOOP BEGIN at src/group_su3.f90(56,5)
   remark #25101: Loop Interchange not done due to: Original Order seems proper
   remark #25452: Original Order found to be proper, but by a close margin
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at src/group_su3.f90(56,5)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at src/group_su3.f90(57,8)
         remark #15389: vectorization support: reference g has unaligned access
         remark #15389: vectorization support: reference a has unaligned access
         remark #15389: vectorization support: reference b has unaligned access
         remark #15389: vectorization support: reference a has unaligned access
         remark #15389: vectorization support: reference b has unaligned access
         remark #15389: vectorization support: reference a has unaligned access
         remark #15389: vectorization support: reference b has unaligned access
         remark #15381: vectorization support: unaligned access used inside loop body
         remark #15305: vectorization support: vector length 2
         remark #15399: vectorization support: unroll factor set to 2
         remark #15309: vectorization support: normalized vectorization overhead 0.239
         remark #15300: LOOP WAS VECTORIZED
         remark #15450: unmasked unaligned unit stride loads: 6
         remark #15451: unmasked unaligned unit stride stores: 1
         remark #15475: --- begin vector loop cost summary ---
         remark #15476: scalar loop cost: 33
         remark #15477: vector loop cost: 22.000
         remark #15478: estimated potential speedup: 1.490
         remark #15488: --- end vector loop cost summary ---
         remark #25015: Estimate of max trip count of loop=128
      LOOP END
   LOOP END
LOOP END

This means, that for some reason beyond my understanding, he claims that variables a and b have the incorrect alignement in memory, and therefore although he vectorizes the loop, he does not expect much improvement.

Since I am asking to use avx instructions (SDIM size of 256 bits), that
can process at the time 2 double precision complex numbers, I would expect a
naive speedup of 2x.

Now if I use the -ipo command, I get (ifort -ipo -opt-report5 -O3 -xavx -align array128byte -opt-assume-safe-padding):

LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8)
   remark #25101: Loop Interchange not done due to: Original Order seems proper
   remark #25452: Original Order found to be proper, but by a close margin
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at src/group_su3.f90(56,5) inlined into src/main.f90(10,8)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at src/group_su3.f90(57,8) inlined into src/main.f90(10,8)
         remark #15388: vectorization support: reference a.COMP.V has aligned access
         remark #15388: vectorization support: reference c.COMP.V has aligned access
         remark #15388: vectorization support: reference b.COMP.V has aligned access
         remark #15388: vectorization support: reference c.COMP.V has aligned access
         remark #15388: vectorization support: reference b.COMP.V has aligned access
         remark #15388: vectorization support: reference c.COMP.V has aligned access
         remark #15388: vectorization support: reference b.COMP.V has aligned access
         remark #15305: vectorization support: vector length 2
         remark #15399: vectorization support: unroll factor set to 2
         remark #15300: LOOP WAS VECTORIZED
         remark #15448: unmasked aligned unit stride loads: 6
         remark #15449: unmasked aligned unit stride stores: 1
         remark #15475: --- begin vector loop cost summary ---
         remark #15476: scalar loop cost: 33
         remark #15477: vector loop cost: 14.000
         remark #15478: estimated potential speedup: 2.350
         remark #15488: --- end vector loop cost summary ---
         remark #25015: Estimate of max trip count of loop=128
      LOOP END
   LOOP END
LOOP END

The compiler correctly identifies the components v(:) to be aligned and vectorizes the loop.

Now the questions:

1) What am I missing? Why the original loop, without in-linning the function does not vectorize correctly.

2) A more general question: Does all this shufling of data layout makes
sense for you? Could I expect an improvement by using a SoA layout if my
codes uses most the time in small matrix multiplications? Any other hints
for writing efficient code for modern architectures while being standard
conforming?

Many thanks!

Problem vectorizing FORTRAN code

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112