Unexpectedly slow performance on moving from register to frequently accessed variable

Question

I'm learning about how caching works using the following example:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

typedef uint32_t data_t;
const int U = 10000000;   // size of the array. 10 million vals ~= 40MB
const int N = 100000000;  // number of searches to perform

int main() {
  data_t* data = (data_t*) malloc(U * sizeof(data_t));
  if (data == NULL) {
    free(data);
    printf("Error: not enough memory\n");
    exit(-1);
  }

  // fill up the array with sequential (sorted) values.
  int i;
  for (i = 0; i < U; i++) {
    data[i] = i;
  }

  printf("Allocated array of size %d\n", U);
  printf("Summing %d random values...\n", N);

  data_t val = 0;
  data_t seed = 42;
  for (i = 0; i < N; i++) {
    int l = rand_r(&seed) % U;
    val = (val + data[l]);
  }

  free(data);
  printf("Done. Value = %d\n", val);
  return 0;
}

The relevant annotation of the slow random access loop, found using perf record ./sum, and then perf report, is

  0.05 │       mov    -0x18(%rbp),%eax                                                                 ▒
  0.07 │       mov    -0x10(%rbp),%rcx                                                                 ▒
       │       movslq -0x20(%rbp),%rdx                                                                 ▒
  0.03 │       add    (%rcx,%rdx,4),%eax                                                               ▒
 95.39 │       mov    %eax,-0x18(%rbp)                                                                 ▒
  1.34 │       mov    -0x14(%rbp),%eax                                                                 ▒
       │       add    $0x1,%eax                                                                        ◆
       │       mov    %eax,-0x14(%rbp)

At this point, -0x18 holds val, -0x10 holds data, -0x14 holds i, and -0x20 holds l. The numbers on the left column show the % of time spent on that instruction. I expected that the line add (%rcx, %rdx, 4), %eax would take up the most time, since it has to perform a random access load for data[l] (which is just (%rcx, %rdx, 4)). This should only be in the L1 cache about 16k/U = 0.16% of the time, since my L1 cache has size 64k bytes, or 16k integers. So this operation should be slow. In contrast, the operation that apparently is slow just moves from a register %eax into val which is used so often that it is definitely in cache. Can anyone explain what's going on?

HW performance counters usually "blame" the instruction waiting for the slow result (the store), not the instruction that was slow to produce it (the memory-source add). Possible duplicate of [Inconsistent \`perf annotate\` memory load/store time reporting](https://stackoverflow.com/q/65906312) which is a less simple/obvious case but this fact is part of explaining that. — Peter Cordes, Jul 26 '23 at 17:52
Thanks, that clears it up! Found something else confusing. I ran `valgrind --tool=cachegrind --branch-sim=yes ./sum` and it's showing the expected L1 data read & write misses (99.88 mill and 625k resp). But for D refs, I'm seeing 950M reads and 520M writes. What does this mean? Is it from free or malloc? — Steven Mai, Jul 26 '23 at 18:13
`==4617== D refs: 1,470,053,083 (950,039,029 rd + 520,014,054 wr) ==4617== D1 misses: 100,506,068 ( 99,880,381 rd + 625,687 wr) ==4617== LLd misses: 37,858,821 ( 37,233,169 rd + 625,652 wr) ==4617== D1 miss rate: 6.8% ( 10.5% + 0.1% ) ==4617== LLd miss rate: 2.6% ( 3.9% + 0.1% )` — Steven Mai, Jul 26 '23 at 18:13
What CPU model do you have? You say you have a 64KiB L1d cache? Most Intels have 32K or recently 48K. — Peter Cordes, Jul 26 '23 at 18:14
`caches (sum of all): L1d: 96 KiB (2 instances) L1i: 64 KiB (2 instances) L2: 2.5 MiB (2 instances) L3: 48 MiB (2 instances)` ran with lscpu — Steven Mai, Jul 26 '23 at 18:17
I asked what CPU model number, like i7-6700k or something. I really wanted to be sure it was Intel sandybridge-family, rather than AMD. (But now I'm also wondering how you have 48 MiB of L3 cache in a 2-core CPU. Oh, is that a VM instance on an Ice Lake or Sapphire Rapids Xeon?) — Peter Cordes, Jul 26 '23 at 18:23
Yeah I'm using a VM. My cpu is 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz — Steven Mai, Jul 26 '23 at 18:25
Ok, so your VM and/or `lscpu` did a bad job and didn't realize that both cores are sharing the same 24 MiB L3 cache on your Tiger Lake CPU. L1d/i and L2 are per-core private (so it's silly to sum them), but L3 is shared across all cores in a socket on Intel CPUs. (Unlike AMD Zen where groups of 8 cores share an L3 cache.) — Peter Cordes, Jul 26 '23 at 18:32

Peter Cordes · Accepted Answer · 2023-07-26T18:40:14.627

HW performance counters usually "blame" the instruction waiting for the slow result (the store), not the instruction that was slow to produce it. (The memory-source add which micro-fused into a load+add uop).

And/or they blame the next instruction after a cache-miss load regardless of data dependencies. This is called a "skid" or "skew". See for example https://easyperf.net/blog/2018/08/29/Understanding-performance-events-skid and https://www.brendangregg.com/perf.html

My hypothesis for the cause of this effect is that Intel CPUs I think wait for the oldest instruction in the ROB to retire when an interrupt is raised, perhaps to avoid starving the main thread in a high-interrupt situation. For a cache-miss load that ends up stalling out-of-order execution, it will be the oldest in the ROB, unable to retire until the load data arrives (because x86's strongly-ordered memory model doesn't let loads retire before then, even if they're known to be non-faulting, unlike ARM). So when the "cycles" event's counter ticks down to zero and triggers a sample, the cache-miss load retires and the next instruction in program order gets the "blame" for that sample.

For events intended to attach to a specific instruction, like mem_load_retired.l3_miss, a skid is more problematic, but Intel PEBS avoids that. In the previous paragraph I was talking about the "cycles" event which ticks every cycle while stalled, but you could potentially get the same thing for mem_load_retired.l3_miss which couldn't be detected until hearing back from the L3 slice.

In code that doesn't stall, the first or second in a group of instructions that all retire in the same cycle may get the blame. The CPU has to pick one instruction to blame out of everything in-flight in the pipeline. Either by where it raises an interrupt (non-PEBS) or which instruction address goes in the PEBS buffer.

See also Inconsistent `perf annotate` memory load/store time reporting which is a less simple/obvious case but blaming the instruction waiting for the slow result is a key part of that.

Sparky · Answer 2 · 2023-07-27T14:29:05.347

-2

Here is what I think is going on ...

First thing to note is that the line int l = rand_r(&seed) % U; is invariant. That is, it always gets calculated to the same thing. As a result, the compiler has cached that value in either %rcx or %rdx.

With that in mind let's look at these lines ...

  0.03 │       add    (%rcx,%rdx,4),%eax                                                               
 95.39 │       mov    %eax,-0x18(%rbp)

The first of those will add data[l] and val together, storing the result in %eax. As data[l] never changes, it will always load the value from the same memory location. This value is likely getting cached by the hardware. The second of those is writing the sum to a local variable on the stack: val.

It is the writing of the value to the stack that is taking so much time. The type of cache employed here could be write-through so the backing memory would be updated every time this value changes.

Does it need to update val every iteration? No. But it is not wrong to do so.

*** Update ***

Others have both correctly indicated and identified a foundational flaw in my explanation above that renders it utterly and completely wrong. Please disregard it. :)

edited Jul 27 '23 at 14:29

answered Jul 26 '23 at 18:12

Sparky

13,505
4
26
27

2

I think the random number generated is not the same each loop. For example, running `data_t seed = 42; printf("%d\n", rand_r(&seed)); printf("%d\n", rand_r(&seed));` prints out two different values – Steven Mai Jul 26 '23 at 18:21
1

The re-entrant function `rand_r(unsigned int *seed)` is a PRNG that uses user-supplied storage for the seed, rather than an internal global with locking like `rand`. https://man7.org/linux/man-pages/man3/srand.3.html . It's definitely not loop-invariant. Also, this looks like a debug build so it wouldn't get hoisted even if it could have been. And the only x86-64 to use write-through L1d caches was Bulldozer-family, but even that a 4K write-combining buffer between it and write-back L2 (so memset performance fell off at a 4K size instead of at L1d size). – Peter Cordes Jul 26 '23 at 18:29
1

It's theoretically possible that someone could boot Linux and set the MTRRs or PAT attributes to make normal memory write-through instead of write-back cacheable, requiring write-through all the way to DRAM, but I *hope* they'd have mentioned that because it would take some serious work! Also, this is a debug build with other stores like `mov %eax,-0x14(%rbp)` after the counter increment, but that didn't get any counts for the `cycles` event. (Don't benchmark debug builds, @StevenMai ) – Peter Cordes Jul 26 '23 at 18:29
1

I appreciate the identification of the flaw in my logic. It helps me be a little better in the future. :) – Sparky Jul 27 '23 at 14:26

Unexpectedly slow performance on moving from register to frequently accessed variable

2 Answers2

Linked

Related