I'm learning about how caching works using the following example:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
typedef uint32_t data_t;
const int U = 10000000; // size of the array. 10 million vals ~= 40MB
const int N = 100000000; // number of searches to perform
int main() {
data_t* data = (data_t*) malloc(U * sizeof(data_t));
if (data == NULL) {
free(data);
printf("Error: not enough memory\n");
exit(-1);
}
// fill up the array with sequential (sorted) values.
int i;
for (i = 0; i < U; i++) {
data[i] = i;
}
printf("Allocated array of size %d\n", U);
printf("Summing %d random values...\n", N);
data_t val = 0;
data_t seed = 42;
for (i = 0; i < N; i++) {
int l = rand_r(&seed) % U;
val = (val + data[l]);
}
free(data);
printf("Done. Value = %d\n", val);
return 0;
}
The relevant annotation of the slow random access loop, found using perf record ./sum, and then perf report, is
0.05 │ mov -0x18(%rbp),%eax ▒
0.07 │ mov -0x10(%rbp),%rcx ▒
│ movslq -0x20(%rbp),%rdx ▒
0.03 │ add (%rcx,%rdx,4),%eax ▒
95.39 │ mov %eax,-0x18(%rbp) ▒
1.34 │ mov -0x14(%rbp),%eax ▒
│ add $0x1,%eax ◆
│ mov %eax,-0x14(%rbp)
At this point, -0x18 holds val, -0x10 holds data, -0x14 holds i, and -0x20 holds l. The numbers on the left column show the % of time spent on that instruction. I expected that the line
add (%rcx, %rdx, 4), %eax would take up the most time, since it has to perform a random access load for data[l] (which is just (%rcx, %rdx, 4)). This should only be in the L1 cache about 16k/U = 0.16% of the time, since my L1 cache has size 64k bytes, or 16k integers. So this operation should be slow. In contrast, the operation that apparently is slow just moves from a register %eax into val which is used so often that it is definitely in cache. Can anyone explain what's going on?