How can compiler optimizations affect on a cache-unfriendly algorithm?

Question

I've noticed an interesting behavior in the code of this question which is also comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache (cache associativity). The explanations is clear for me, but then someone pings about volatile... That is if we add volatile qualifier to the matrix declaration: volatile int mat[MATSIZE][MATSIZE]; the running time for value 512 dramatically decreases: 2144 → 1562 μs.

As we know volatile prevents compilers from caching the value (in a CPU register) and from optimizing away accesses to that value when they seem unnecessary from the POV of a program.

One possible version assumes that the computation process happens only in RAM and no cpu caches is used in the case of volatile. But on the other hand the run-time for value 513 again is less than for 512: 1490 μs...

Sounds like testing error/noise. Depending on your architecture and OS, different virtual memory can have different caching characteristics. Most particularly for tests like this. It's important that a good benchmark test the operations at a variety of virtual addresses to make sure it's not grossly distorted by a knotty patch of VM. That is to say; you allocate a lot of memory and then run your test at different offsets within that allocation. — sh1, Mar 10 '17 at 17:57
Pure speculation: the introduction of volatile could be generating more memory accesses influencing the cache replacement behavior (i.e., a cache line might be more recently accessed and so not evicted before its next use). — , Mar 11 '17 at 01:18

Ferenc Deak · Accepted Answer · 2017-03-11T21:28:35.110

Unfortunately I cannot confirm that the volatile version is running faster. I did a test run of both the volatile and the non-volatile version, and the time comparison of both can be seen in the chart below. When measuring performance in order for optimizing code it is always important to take several steps (not just one or two, but in the range of thousands) and take the mode (https://en.wikipedia.org/wiki/Mode_(statistics)) of the gathered data as per Alexandrescu's Fastware.

There are various peaks, and deep valleys, but looking at the chart you cannot conclude that the volatile is faster.

Indeed, the code generated by the compilers is different, but not to such an extent, you can check it out on https://godbolt.org/g/ILw3tg

How can compiler optimizations affect on a cache-unfriendly algorithm?

1 Answers1