I've noticed an interesting behavior in the code of this question which is also comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache (cache associativity). The explanations is clear for me, but then someone pings about volatile...
That is if we add volatile qualifier to the matrix declaration: volatile int mat[MATSIZE][MATSIZE]; the running time for value 512 dramatically decreases: 2144 → 1562 μs.
As we know volatile prevents compilers from caching the value (in a CPU register) and from optimizing away accesses to that value when they seem unnecessary from the POV of a program.
One possible version assumes that the computation process happens only in RAM and no cpu caches is used in the case of volatile. But on the other hand the run-time for value 513 again is less than for 512: 1490 μs...
