I'm implementing a C code to compare the effect of spatial locality when accessing a matrix in a row-major order and in a column-major order.
However, the row-major way got a slower result, which is not what I've learned.
I'm running on ubuntu-16.04 (on vmware) with 2GB memory and 2 cores and I use gcc 5.4.0 to compile the code.
I create a 2000*2000 integer matrix and use "rdtsc" to measure the total clocks during the traversal.
I also tried to use clock_gettime() method to measure the time, and it showed similar result.
Here is the code I'm using to test these cases:
typedef enum {
    ROW_MAJOR=1,
    COLUMN_MAJOR
} traverse_type;
static inline unsigned long long read_tsc(void)
{
    unsigned low, high;
    asm volatile("rdtsc" :"=a"(low), "=d"(high));
    return ((low) | ((u64)(high) << 32));
}
void calc_matrix_traverse_clocks(traverse_type type) {
    int mat[ROW][COLUMN];
    int i, j, sum;
    unsigned long long before, after;
    if (type == ROW_MAJOR) {
        before = read_tsc();
        for (i = 0; i < ROW; i++) {
            for (j = 0; j < COLUMN; j++) {
               sum += mat[i][j];
            }
        }
        after = read_tsc();
        printf("ROW Major: %lld\n", after-before);
    } else {
        before  = read_tsc();
        for (i = 0; i < COLUMN; i++) {
            for (j = 0; j < ROW; j++) {
                sum += mat[j][i];
            }
        }
        after = read_tsc();
        printf("COLUMN Major: %lld\n", after-before);
    }
}
int main (int argc, char *argv[]) {
    calc_matrix_traverse_clocks(ROW_MAJOR);
    //calc_matrix_traverse_clocks(COLUMN_MAJOR);    
    return 0;
}
I've run the program several times. Even if I put these two cases into two files and run them respectively, the results are always similar:
ROW Major: 9274244
COLUMN Major: 5856220
To make sure the row-major way is having spatial locality, I've checked the addresses between two successive memory accesses are continuous. And the column-major way is not.
Besides, I also checked number of cache misses of these two versions. And it actually shows less cache miss rate when I used the row-major method. But this time the clocks measure in the row-major traversal becomes the less one.
Here is the result of valgrind:
Row-major version:
ROW Major: 124149507
==10786== 
==10786== I   refs:      11,160,797
==10786== I1  misses:           955
==10786== LLi misses:           944
==10786== I1  miss rate:       0.01%
==10786== LLi miss rate:       0.01%
==10786== 
==10786== D   refs:       6,054,374  (6,041,857 rd   + 12,517 wr)
==10786== D1  misses:        65,590  (   64,992 rd   +    598 wr)
==10786== LLd misses:        65,113  (   64,554 rd   +    559 wr)
==10786== D1  miss rate:        1.1% (      1.1%     +    4.8%  )
==10786== LLd miss rate:        1.1% (      1.1%     +    4.5%  )
==10786== 
==10786== LL refs:           66,545  (   65,947 rd   +    598 wr)
==10786== LL misses:         66,057  (   65,498 rd   +    559 wr)
==10786== LL miss rate:         0.4% (      0.4%     +    4.5%  )
Column-major version:
COLUMN Major: 131878026
==10793== 
==10793== I   refs:      11,160,943
==10793== I1  misses:           952
==10793== LLi misses:           943
==10793== I1  miss rate:       0.01%
==10793== LLi miss rate:       0.01%
==10793== 
==10793== D   refs:       6,054,434  (6,041,899 rd   + 12,535 wr)
==10793== D1  misses:     1,003,086  (1,002,488 rd   +    598 wr)
==10793== LLd misses:        65,104  (   64,545 rd   +    559 wr)
==10793== D1  miss rate:       16.6% (     16.6%     +    4.8%  )
==10793== LLd miss rate:        1.1% (      1.1%     +    4.5%  )
==10793== 
==10793== LL refs:        1,004,038  (1,003,440 rd   +    598 wr)
==10793== LL misses:         66,047  (   65,488 rd   +    559 wr)
==10793== LL miss rate:         0.4% (      0.4%     +    4.5%  )
As shown in valgrind, the row-major version actually has a lower miss rates. This confused me and now I'm wondering if I've done something wrong with this code or measurement method.
