Why is element-wise matrix row-exchanging more efficient than array-wise row-exchanging in Fortran?

Question

I have some Fortran code that performs a matrix row exchange. In the legacy code, it is written as

do J = 1,N
   save_val  = B(IROW,J)
   B(IROW,J) = B(JCOL,J)
   B(JCOL,J) = save_val
end do

This exchanges row IROW with row JCOL (IROW and JCOL are integers). However, the function of that block of code is not intuitive. In my opinion, it would be more intuitive, or at least aid readability to write it as:

save_row  = B(IROW,:)
B(IROW,:) = B(JCOL,:)
B(JCOL,:) = save_row

(it is more clear that rows are being moved).

From the image attached, it is clear that the looping methods provide better performance relative to the array operations. Why is this? Is it because this becomes a memory limited process when the number of elements in the array gets large? (i.e. the array would get "chunked") Or is it something else?

Compiling as gfortran -O3 test.f95. Adding the flag fstack-arrays did not make significant difference.

program test

  implicit none

  integer  :: N
  integer  :: M
  integer  :: loop_max = 1e7
  integer  :: i                    ! loop index
  real     :: t1, t2
  real     :: time_loop, time_array, time_sub_loop, time_sub_array

  real, dimension(:, :), allocatable   :: B
  real, dimension(:)   , allocatable   :: save_row

  real :: save_val
  integer :: IROW, J, JCOL

  character(*), parameter :: format_header = '(A5, 1X, 4(A12,1X))'
  character(*), parameter :: format_data = '(I5, 1X, 4(ES12.5, 1X))'


  open(1, file = 'TimingRowExchange.txt', status = 'unknown')
  write(1, format_header) 'N', 't_loop', 't_array', 't_sub_loop', 't_sub_array'

  do N = 1, 100
    M = N + 1
    allocate(B(N,N), save_row(M))
    call random_number(B)

    JCOL = 1
    IROW = 3

    call CPU_time(t1)
    do i = 1, loop_max
      do J = 1,N
        save_val  = B(IROW,J)
        B(IROW,J) = B(JCOL,J)
        B(JCOL,J) = save_val
      end do
    end do
    call CPU_time(t2)
    time_loop = t2 - t1
    ! write ( *, * ) 'Using Loop =', t2 - t1


    call CPU_time(t1)
    do i = 1, loop_max
        save_row(1:N) = B(IROW,:)
        B(IROW,:)     = B(JCOL,:)
        B(JCOL,:)     = save_row(1:N)
    end do
    call CPU_time(t2)
    time_array = t2 - t1
    ! write ( *, * ) 'Using Array =', t2 - t1

    call CPU_time(t1)
    do i = 1, loop_max
      call exchange_rows_loop(B, JCOL, IROW)
    end do
    call CPU_time(t2)
    time_sub_loop = t2 - t1
    ! write ( *, * ) 'Loop Subrout =', t2 - t1


    call CPU_time(t1)
    do i = 1, loop_max
      call exchange_rows_array(B, JCOL, IROW)
    end do
    call CPU_time(t2)
    time_sub_array = t2 - t1
    ! write ( *, * ) 'Array Subrout =', t2 - t1

    deallocate(B, save_row)
    write(1, format_data) N, time_loop, time_array, time_sub_loop, time_sub_array
  end do


contains


  subroutine print_mat(A)
    implicit none
    real, dimension(:,:), intent(in) :: A
    integer :: n

    n = size(A,1) ! # of rows

    do i = 1,n
      print*, A(i,:)
    end do
    print*,

  end subroutine print_mat



  subroutine exchange_rows_loop(A, row1, row2)
    implicit none
    real, dimension(:,:), intent(in out) :: A
    integer,              intent(in)     :: row1, row2

    integer :: J
    real :: save_val

    do J = 1, size(A,1)
      save_val  = A(row1,J)
      A(row1,J) = A(row2,J)
      A(row2,J) = save_val
    end do

  end subroutine exchange_rows_loop



  subroutine exchange_rows_array(A, row1, row2)
    implicit none
    real, dimension(:,:), intent(in out) :: A
    integer,              intent(in)     :: row1, row2

    real, dimension(size(A,1))           :: save_row

    save_row  = A(row1,:)
    A(row1,:) = A(row2,:)
    A(row2,:) = save_row

  end subroutine exchange_rows_array


end program test

This is quite typical. Manual loops tend to be faster. However, there will be something more here, the difference is big. You are forcing the writing and reading of larger pieces of memory. I expect some cache issues. — Vladimir F Героям слава, Sep 29 '20 at 12:14
Actually, vectorization issues are much more likely, cache issues mere like changes the behaviour with different N. — Vladimir F Героям слава, Sep 29 '20 at 12:44
I also object to the waste of memory in the array form when a scalar will do ... Also note that you have out of bounds array accesses when N = 1 and 2 as IROW is set to 3. It probably won't make a difference, but I would re-run starting with N=3 as otherwise the code is incorrect and in theory can do anything. — Ian Bush, Sep 29 '20 at 12:48
As regards caching I would have thought that repeatedly looping over these small matrices and exactly repeating the same memory access pattern, as the elements of interest will get loaded into cache on the first iteration, and then, as the matrices are small, remain there for the rest of the run. But I'm not sure how stores work on modern x86 (assuming that is the architecture), almost all I do is load bound. — Ian Bush, Sep 29 '20 at 12:51
I'm wondering if the same problems arise if you do column swaps instead of row swaps. Since Fortran stores arrays in column-major order, I can imagine that a column-swap is more [cache friendly](https://stackoverflow.com/q/16699247/8344060) than a row-swap from a given size onwards. — kvantour, Sep 29 '20 at 13:00
@kvantour The compiler does not even know the stride in advance so the vectorization possibilities are really limited. — Vladimir F Героям слава, Sep 29 '20 at 14:29

score 2 · Accepted Answer · answered Sep 29 '20 at 14:14

My understanding of Fortran's philosophy (strength) is that the language should help the user focus on the science, while taking care of most computer-related stuff such as optimization-for-speed, garbage collection, etc.

Functional programming style via pure/elemental functions and subroutines is IMHO one of the greatest tools that has been introduced yet underutilized, because it makes code clear, simpler and more robust.

So I've added one more test with an elemental swapping routine:

  subroutine exchange_rows_elemental(A, row1, row2)
    implicit none
    real, dimension(:,:), intent(in out) :: A
    integer,              intent(in)     :: row1, row2
    call swap(A(row1,:),A(row2,:))
  end subroutine exchange_rows_elemental  

  elemental subroutine swap(a,b)
     real, intent(inout) :: a,b
     real :: save_val
     save_val = a
     a = b
     b = save_val
  end subroutine swap

And in the main:

call CPU_time(t1)
do i = 1, loop_max
  call exchange_rows_elemental(B, JCOL, IROW)
end do
call CPU_time(t2)
time_elemental = t2 - t1
! write ( *, * ) 'Elemental =', t2 - t1

This is what I get with gfortran 9.2.0 on Windows:

The elemental version is almost as fast as the fastest loop versions, but it likely operates in a vectorized way. I'm sure in this case the compiler is probably inlining the swap routine (maybe it wouldn't be able to if it was in another file), but still, telling the compiler that the swap routine can be vectorized probably helps it achieve optimal performance. I like it as a good way to make best usage of compiler optimization without cluttering the source code with nested loops and loop variables.

That is really useful and idneed appears virtually as fast as the loop, but it is not really functional programming. Also, I do not think that elemental helps vectorization per-se. — Vladimir F Героям слава, Sep 29 '20 at 14:18
It is also good to note that one can just `call swap(B(IROW,:),B(JCOL,:))` to get the same effect. — Vladimir F Героям слава, Sep 29 '20 at 14:23

Vladimir F Героям слава · Answer 2 · 2020-09-29T15:41:05.747

Those two versions do not do the same, the ordering is different and the optimizations done by the compiler are different. The assembly is available at https://godbolt.org/z/hc8zcs

The loop:

    movss   xmm0, DWORD PTR [rax+12]
    movss   xmm1, DWORD PTR [rax+4]
    movss   DWORD PTR [rax+4], xmm0
    movss   DWORD PTR [rax+12], xmm1
    cmp     ebx, 1
    je      .L6
    movss   xmm0, DWORD PTR [rsi]
    movss   xmm1, DWORD PTR [rcx]
    movss   DWORD PTR [rsi], xmm1
    movss   DWORD PTR [rcx], xmm0
    cmp     ebx, 2
    je      .L6
    movss   xmm0, DWORD PTR [r8]
    movss   xmm1, DWORD PTR [rdi]
    movss   DWORD PTR [r8], xmm1
    movss   DWORD PTR [rdi], xmm0

It is somewhat unrolled, the vectorized instructions (SSE) are used. Overall it is very straightforward.

The subarray version:

        mov     r10, QWORD PTR [rsp+216]
        mov     rsi, QWORD PTR [rsp+208]
        mov     r8d, 10000000
        mov     rdx, QWORD PTR [rsp+152]
        mov     rdi, QWORD PTR [rsp+144]
        mov     r11, r10
        lea     rbx, [r10+1]
        lea     r15, [r10+2]
        mov     r9, QWORD PTR [rsp+8]
        imul    r11, rsi
        lea     rax, [rdx+1]
        mov     rcx, QWORD PTR [rsp+224]
        imul    rbx, rsi
        lea     r12, [rdx+3]
        imul    r15, rsi
        sal     rsi, 2
        lea     rbp, [r12+r11]
        add     rdx, r11
        add     r11, rax
        lea     r14, [r12+rbx]
        lea     r13, [rdi+rdx*4]
        add     rbx, rax
        add     r12, r15
        add     r15, rax
        mov     QWORD PTR [rsp], r15
        mov     r15, QWORD PTR [rsp+80]
.L13:
        movss   xmm0, DWORD PTR [rdi+rbp*4]
        movss   DWORD PTR [r9], xmm0
        cmp     r15, 1
        je      .L10
        movss   xmm0, DWORD PTR [rdi+r14*4]
        movss   DWORD PTR [r9+4], xmm0
        cmp     r15, 2
        je      .L11
        movss   xmm0, DWORD PTR [rdi+r12*4]
        movss   DWORD PTR [r9+8], xmm0
.L11:
        cmp     r10, rcx
        jg      .L65
.L37:
        mov     rax, r13
        mov     rdx, r10
.L14:
        movss   xmm0, DWORD PTR [rax+4]
        add     rdx, 1
        movss   DWORD PTR [rax+12], xmm0
        add     rax, rsi
        cmp     rcx, rdx
        jge     .L14
        movss   xmm0, DWORD PTR [r9]
        movss   DWORD PTR [rdi+r11*4], xmm0
        cmp     r15, 1
        je      .L15
.L35:
        movss   xmm0, DWORD PTR [r9+4]
        movss   DWORD PTR [rdi+rbx*4], xmm0
        cmp     r15, 3
        jne     .L15
        movss   xmm0, DWORD PTR [r9+8]
        mov     rax, QWORD PTR [rsp]
        movss   DWORD PTR [rdi+rax*4], xmm0

There is a lot of pointer arithmetic, comparison and branching involved. I suspect that the compiler has to check for the aliasing of the subarrays or for zero array extents and it is harder perform efficient vectorization.

Why is element-wise matrix row-exchanging more efficient than array-wise row-exchanging in Fortran?

2 Answers2