Why is row-major matrix access runs slower than column-major matrix access in my C code?

Question

I'm implementing a C code to compare the effect of spatial locality when accessing a matrix in a row-major order and in a column-major order.

However, the row-major way got a slower result, which is not what I've learned.

I'm running on ubuntu-16.04 (on vmware) with 2GB memory and 2 cores and I use gcc 5.4.0 to compile the code.

I create a 2000*2000 integer matrix and use "rdtsc" to measure the total clocks during the traversal.

I also tried to use clock_gettime() method to measure the time, and it showed similar result.

Here is the code I'm using to test these cases:

typedef enum {
    ROW_MAJOR=1,
    COLUMN_MAJOR
} traverse_type;

static inline unsigned long long read_tsc(void)
{
    unsigned low, high;
    asm volatile("rdtsc" :"=a"(low), "=d"(high));
    return ((low) | ((u64)(high) << 32));
}

void calc_matrix_traverse_clocks(traverse_type type) {
    int mat[ROW][COLUMN];
    int i, j, sum;
    unsigned long long before, after;

    if (type == ROW_MAJOR) {
        before = read_tsc();
        for (i = 0; i < ROW; i++) {
            for (j = 0; j < COLUMN; j++) {
               sum += mat[i][j];
            }
        }
        after = read_tsc();
        printf("ROW Major: %lld\n", after-before);
    } else {
        before  = read_tsc();
        for (i = 0; i < COLUMN; i++) {
            for (j = 0; j < ROW; j++) {
                sum += mat[j][i];
            }
        }
        after = read_tsc();
        printf("COLUMN Major: %lld\n", after-before);
    }
}

int main (int argc, char *argv[]) {

    calc_matrix_traverse_clocks(ROW_MAJOR);
    //calc_matrix_traverse_clocks(COLUMN_MAJOR);    

    return 0;
}

I've run the program several times. Even if I put these two cases into two files and run them respectively, the results are always similar:

ROW Major: 9274244
COLUMN Major: 5856220

To make sure the row-major way is having spatial locality, I've checked the addresses between two successive memory accesses are continuous. And the column-major way is not.

Besides, I also checked number of cache misses of these two versions. And it actually shows less cache miss rate when I used the row-major method. But this time the clocks measure in the row-major traversal becomes the less one.

Here is the result of valgrind:

Row-major version:

ROW Major: 124149507
==10786== 
==10786== I   refs:      11,160,797
==10786== I1  misses:           955
==10786== LLi misses:           944
==10786== I1  miss rate:       0.01%
==10786== LLi miss rate:       0.01%
==10786== 
==10786== D   refs:       6,054,374  (6,041,857 rd   + 12,517 wr)
==10786== D1  misses:        65,590  (   64,992 rd   +    598 wr)
==10786== LLd misses:        65,113  (   64,554 rd   +    559 wr)
==10786== D1  miss rate:        1.1% (      1.1%     +    4.8%  )
==10786== LLd miss rate:        1.1% (      1.1%     +    4.5%  )
==10786== 
==10786== LL refs:           66,545  (   65,947 rd   +    598 wr)
==10786== LL misses:         66,057  (   65,498 rd   +    559 wr)
==10786== LL miss rate:         0.4% (      0.4%     +    4.5%  )

Column-major version:

COLUMN Major: 131878026
==10793== 
==10793== I   refs:      11,160,943
==10793== I1  misses:           952
==10793== LLi misses:           943
==10793== I1  miss rate:       0.01%
==10793== LLi miss rate:       0.01%
==10793== 
==10793== D   refs:       6,054,434  (6,041,899 rd   + 12,535 wr)
==10793== D1  misses:     1,003,086  (1,002,488 rd   +    598 wr)
==10793== LLd misses:        65,104  (   64,545 rd   +    559 wr)
==10793== D1  miss rate:       16.6% (     16.6%     +    4.8%  )
==10793== LLd miss rate:        1.1% (      1.1%     +    4.5%  )
==10793== 
==10793== LL refs:        1,004,038  (1,003,440 rd   +    598 wr)
==10793== LL misses:         66,047  (   65,488 rd   +    559 wr)
==10793== LL miss rate:         0.4% (      0.4%     +    4.5%  )

As shown in valgrind, the row-major version actually has a lower miss rates. This confused me and now I'm wondering if I've done something wrong with this code or measurement method.

You should provide the compile line to make your test reproducable. For performance analysis, it's recommended to use at least `-O2`. However, I've real doubts whether your test proves anything. My main concern: You operate on uninitialized data `int mat[ROW][COLUMN];`. AFAIK, that's [U.B.](https://stackoverflow.com/a/4105123/1505939). Checking your code in [**Compiler Explorer**](https://gcc.godbolt.org/z/Tj-SIG), it becomes obvious: The code inside the loops is just not compiled. Whatever you measure, you cannot trust on this (IMHO). — Scheff's Cat, Nov 03 '19 at 09:54
Btw. I first did the test in Compiler Explorer with gcc 9.2. Then, reading your Q again, I switched to gcc 5.4. This didn't change much - the code of loops is just missing. (I was not that sure that I could reproduce it with the older compiler but I could.) ;-) — Scheff's Cat, Nov 03 '19 at 09:57
@Scheff , thanks for your reminder. The provided result above is compiled with using any optimization flags. $ gcc -o matrix matrix.c It seems that the problem is indeed the uninitialized double array in my code. After I added the array initialization part into it, I got a reasonable result. Row-major method is about twice faster than the column-major one, which is also reflected when I use valgrind to observer cache miss rates. — tgudtk, Nov 03 '19 at 11:08
Concerning cache-locality, I once had a similar experience: [SO: Multi-threading benchmarking issues](https://stackoverflow.com/a/52835213/7478597) (the last part of my A). — Scheff's Cat, Nov 03 '19 at 12:59

Why is row-major matrix access runs slower than column-major matrix access in my C code?

0 Answers0