C++ Cache performance odd behavior

Question

I read an article (1.5 years old http://www.drdobbs.com/parallel/cache-friendly-code-solving-manycores-ne/240012736) which talks about cache performance and size of data. They show the following code which they say they ran on an i7 (sandy bridge)

static volatile int array[Size];
static void test_function(void)
{
    for (int i = 0; i < Iterations; i++)
        for (int x = 0; x < Size; x++)
          array[x]++;
}

They make the claim that if they keep Size*Iterations constant, increasing Size, when the size in memory of array increases beyond the L2 cache size they observe a huge spike in time taken to execute (10x).

As an exercise for myself I wanted to try this to see if I could reproduce their results for my machine . (i7 3770k, win7, visual c++ 2012 compiler, Win32 debug mode, no optimizations enabled). To my surprise though, I am not able to see an increase in time taken to execute (even beyond the L3 cache size) which made me think the compiler was somehow optimizing this code. But I dont see any optimizations either. The only change in speed i see is that below the word size of my machine it takes slightly longer. Below are my timings, code listing, and pertinent disassembly.

Does anyone know why:

1) Why the time taken does not increase at all regardless of the size of my array? Or how I could find out?

2) Why does the time taken start high and then decrease until the cache line size is reached, shouldn't more iterations be processed without reading from cache if the data is less than the line size?

Timings:

Size=1,Iterations=1073741824, Time=3829
Size=2,Iterations=536870912, Time=2625
Size=4,Iterations=268435456, Time=2563
Size=16,Iterations=67108864, Time=2906
Size=32,Iterations=33554432, Time=3469
Size=64,Iterations=16777216, Time=3250
Size=256,Iterations=4194304, Time=3140
Size=1024,Iterations=1048576, Time=3110
Size=2048,Iterations=524288, Time=3187
Size=4096,Iterations=262144, Time=3078
Size=8192,Iterations=131072, Time=3125
Size=16384,Iterations=65536, Time=3109
Size=32768,Iterations=32768, Time=3078
Size=65536,Iterations=16384, Time=3078
Size=262144,Iterations=4096, Time=3172
Size=524288,Iterations=2048, Time=3109
Size=1048576,Iterations=1024, Time=3094
Size=2097152,Iterations=512, Time=3313
Size=4194304,Iterations=256, Time=3391
Size=8388608,Iterations=128, Time=3312
Size=33554432,Iterations=32, Time=3109
Size=134217728,Iterations=8, Time=3515
Size=536870912,Iterations=2, Time=3532

code:

#include <string>
#include <cassert>
#include <windows.h>

template <unsigned int SIZE, unsigned int ITERATIONS>
static void test_body(volatile char* array)
{
     for (unsigned int i = 0; i < ITERATIONS; i++)
    {
        for (unsigned int  x = 0; x < SIZE; x++)
        {
            array[x]++;
        }
    }
}

template <unsigned int SIZE, unsigned int ITERATIONS>
static void test_function()
{
    assert(SIZE*ITERATIONS == 1024*1024*1024);
    static volatile char array[SIZE];

    test_body<SIZE, 1>(array); //warmup

    DWORD beginTime = GetTickCount();
    test_body<SIZE, ITERATIONS>(array); 
    DWORD endTime= GetTickCount();
    printf("Size=%u,Iterations=%u, Time=%d\n", SIZE,ITERATIONS, endTime-beginTime);
}

int main()
{
    enum { eIterations= 1024*1024*1024};
    test_function<1, eIterations>();
    test_function<2, eIterations/2>();
    test_function<4, eIterations/4>();
    test_function<16, eIterations/16>();
    test_function<32, eIterations/ 32>();
    test_function<64, eIterations/ 64>();
    test_function<256, eIterations/ 256>();
    test_function<1024, eIterations/ 1024>();
    test_function<2048, eIterations/ 2048>();
    test_function<4096, eIterations/ 4096>();
    test_function<8192, eIterations/ 8192>();
    test_function<16384, eIterations/ 16384>();
    test_function<32768, eIterations/ 32768>();
    test_function<65536, eIterations/ 65536>();
    test_function<262144, eIterations/ 262144>();
    test_function<524288, eIterations/ 524288>();
    test_function<1048576, eIterations/ 1048576>();
    test_function<2097152, eIterations/ 2097152>();
    test_function<4194304, eIterations/ 4194304>();
    test_function<8388608, eIterations/ 8388608>();
    test_function<33554432, eIterations/ 33554432>();
    test_function<134217728, eIterations/ 134217728>();
    test_function<536870912, eIterations/ 536870912>();
}

Disassembly

    for (unsigned int i = 0; i < ITERATIONS; i++)
00281A59  mov         dword ptr [ebp-4],0  
00281A60  jmp         test_body<536870912,2>+1Bh (0281A6Bh)  
00281A62  mov         eax,dword ptr [ebp-4]  
00281A65  add         eax,1  
00281A68  mov         dword ptr [ebp-4],eax  
00281A6B  cmp         dword ptr [ebp-4],2  
00281A6F  jae         test_body<536870912,2>+53h (0281AA3h)  
    {
        for (unsigned int  x = 0; x < SIZE; x++)
00281A71  mov         dword ptr [ebp-8],0  
00281A78  jmp         test_body<536870912,2>+33h (0281A83h)  
00281A7A  mov         eax,dword ptr [ebp-8]  
    {
        for (unsigned int  x = 0; x < SIZE; x++)
00281A7D  add         eax,1  
00281A80  mov         dword ptr [ebp-8],eax  
00281A83  cmp         dword ptr [ebp-8],20000000h  
00281A8A  jae         test_body<536870912,2>+51h (0281AA1h)  
        {
            array[x]++;
00281A8C  mov         eax,dword ptr [array]  
00281A8F  add         eax,dword ptr [ebp-8]  
00281A92  mov         cl,byte ptr [eax]  
00281A94  add         cl,1  
00281A97  mov         edx,dword ptr [array]  
00281A9A  add         edx,dword ptr [ebp-8]  
00281A9D  mov         byte ptr [edx],cl  
        }
00281A9F  jmp         test_body<536870912,2>+2Ah (0281A7Ah)  
    }
00281AA1  jmp         test_body<536870912,2>+12h (0281A62h)

skimon, please update your code to do full single iteration before starting timing for each size. (warmup, like in Joky's answer). — osgx, Apr 27 '14 at 04:01
skimon, if you want to be .. to measure cache sizes and latencies, use classic cache test http://stackoverflow.com/a/21542939/196561 - walk on linked list, when the stride (distance in memory between linked elements) is changed. This is the test lat_mem_rd of lmbench, also used by CPU-Z. — osgx, Apr 27 '14 at 04:08
@osgx ok update the code to do warmup and reposted timings and disassembly. — skimon, Apr 27 '14 at 20:30
@osgx i ran cpu-z to check the cache sizes before posting my original question. I am not really concerned with the timings, rather to understand the 2 points I mention. Thanks — skimon, Apr 27 '14 at 20:33
skimon, thank you. There is hardware prefetch engines near L1 and L2 in Intel chips, each can detect linear access patterns and load data early. But the total time (if it is ms=milliseconds) is too high (speed of access is near memory speed even for smaller cache). I will try to explore hw perfcounters. Can you publish several windows binaries each for only one size (32 1024 32768 1048576 33554432)? — osgx, Apr 27 '14 at 20:36
@osgx ok i will provide binaries shortly. It is milliseconds , but there are 10^9 iterations. — skimon, Apr 28 '14 at 14:22
@osgx here are the binaries http://www.fileswap.com/dl/Wk8apyYiq/ — skimon, Apr 29 '14 at 02:30

score 8 · Accepted Answer · edited May 23 '17 at 10:30

TL;DR: Your test is not correct test for cache latency or speed. Instead it measures some problems of chopping complex code through OoO CPU pipeline.

Use right tests for measuring cache and memory latency: lat_mem_rd from lmbench; and right tests for speed (bandwidth) measurements: STREAM benchmark for memory speed; tests from memtest86 for cache speed with rep movsl main operation)

Also, in modern (2010 and newer) desktop/sever CPUs there is hardware prefetch logic built in near L1 and L2 caches which will detect linear access pattern and preload data from outer caches into inner before you will ask for this data: Intel Optimization Manual - 7.2 Hardware prefetching of data, page 365; intel.com blog, 2009. It is hard to disable all hardware prefetches (SO Q/A 1, SO Q/A 2)

Long story:

I will try to do several measurements of similar test with perf performance monitoring tool in Linux (aka perf_events). The code is based on program from Joky (array of 32-bit ints, not of chars), and was separated into several binaries as: a5 is for size 2^5 = 32; a10 => 2^10 = 1024 (4 KB); a15 => 2^15 = 32768, a20 (1 million of ints = 4 MB) and a25 (32 millions of ints = 128MB). The cpu is i7-2600 quad-core Sandy Bridge 3.4 GHz.

Let's start with basic perf stat with default event set (some lines are skipped). I selected 2^10 (4 KB) and 2^20 (4 MB)

$ perf stat ./a10
Size=1024 ITERATIONS=1048576, TIME=2372.09 ms

 Performance counter stats for './a10':

               276 page-faults               #    0,000 M/sec
     8 238 473 169 cycles                    #    3,499 GHz
     4 936 244 310 stalled-cycles-frontend   #   59,92% frontend cycles idle
       415 849 629 stalled-cycles-backend    #    5,05% backend  cycles idle
    11 832 421 238 instructions              #    1,44  insns per cycle
                                             #    0,42  stalled cycles per insn
     1 078 974 782 branches                  #  458,274 M/sec
         1 080 091 branch-misses             #    0,10% of all branches

$ perf stat ./a20
Size=1048576 ITERATIONS=1024, TIME=2432.4 ms

 Performance counter stats for './a20':

             2 321 page-faults               #    0,001 M/sec
     8 487 656 735 cycles                    #    3,499 GHz
     5 184 295 720 stalled-cycles-frontend   #   61,08% frontend cycles idle
       663 245 253 stalled-cycles-backend    #    7,81% backend  cycles idle
    11 836 712 988 instructions              #    1,39  insns per cycle
                                             #    0,44  stalled cycles per insn
     1 077 257 745 branches                  #  444,104 M/sec
            30 601 branch-misses             #    0,00% of all branches

What we can see here? Instruction counts are very close (because Size*Iterations is constant), cycle count and time are close too. Both examples have 1 billion branches with 99% good prediction. But there is very high 60% stall count for frontend and 5-8% for backend. Frontend stalls are stalls in the instruction fetch and decode, it can be hard to tell why, but for your code frontend can't decode 4 instructions per tick (page B-41 of Intel optimisation manual, section B.3 - "Performance tuning techniques for ... Sandy Bridge", subsection B.3.2 Hierarchical Top-Down Performance Characterization ...)

$ perf record -e stalled-cycles-frontend ./a20
Size=1048576 ITERATIONS=1024, TIME=2477.65 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.097 MB perf.data (~4245 samples) ]
$ perf annotate -d a20|cat
 Percent |      Source code & Disassembly of a20
------------------------------------------------
         :      08048e6f <void test_body<1048576u, 1024u>(int volatile*)>:

   10.43 :       8048e87:       mov    -0x8(%ebp),%eax
    1.10 :       8048e8a:       lea    0x0(,%eax,4),%edx
    0.16 :       8048e91:       mov    0x8(%ebp),%eax
    0.78 :       8048e94:       add    %edx,%eax
    6.87 :       8048e96:       mov    (%eax),%edx
   52.53 :       8048e98:       add    $0x1,%edx
    9.89 :       8048e9b:       mov    %edx,(%eax)
   14.15 :       8048e9d:       addl   $0x1,-0x8(%ebp)
    2.66 :       8048ea1:       mov    -0x8(%ebp),%eax
    1.39 :       8048ea4:       cmp    $0xfffff,%eax

Or here with raw opcodes (objdump -d), some have rather complicated indexing, so possible they can't be handled by 3 simple decoders and waits for the only complex decoder (image is there: http://www.realworldtech.com/sandy-bridge/4/)

 8048e87:       8b 45 f8                mov    -0x8(%ebp),%eax
 8048e8a:       8d 14 85 00 00 00 00    lea    0x0(,%eax,4),%edx
 8048e91:       8b 45 08                mov    0x8(%ebp),%eax
 8048e94:       01 d0                   add    %edx,%eax
 8048e96:       8b 10                   mov    (%eax),%edx
 8048e98:       83 c2 01                add    $0x1,%edx
 8048e9b:       89 10                   mov    %edx,(%eax)
 8048e9d:       83 45 f8 01             addl   $0x1,-0x8(%ebp)
 8048ea1:       8b 45 f8                mov    -0x8(%ebp),%eax
 8048ea4:       3d ff ff 0f 00          cmp    $0xfffff,%eax

Backend stalls are stalls created by waiting for memory or cache (the thing you are interested in when measuring caches) and by internal execution core stalls:

$ perf record -e stalled-cycles-backend ./a20
Size=1048576 ITERATIONS=1024, TIME=2480.09 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.095 MB perf.data (~4149 samples) ]
$ perf annotate -d a20|cat
    4.25 :       8048e96:       mov    (%eax),%edx
   58.68 :       8048e98:       add    $0x1,%edx
    8.86 :       8048e9b:       mov    %edx,(%eax)
    3.94 :       8048e9d:       addl   $0x1,-0x8(%ebp)
    7.66 :       8048ea1:       mov    -0x8(%ebp),%eax
    7.40 :       8048ea4:       cmp    $0xfffff,%eax

Most backend stalls are reported for add 0x1,%edx because it is the consumer of data, loaded from the array in previous command. With store to array they account for 70% of backend stalls, or if we multiply if for total backend stall portion in the program (7%), for the 5% of all stalls. Or in other words, the cache is faster than your program. Now we can answer to your first question:

Why the time taken does not increase at all regardless of the size of my array?

You test is so bad (not optimized), that you are trying to measure caches, but they have only 5% slowdown on total run time. Your test is so unstable (noisy) that you will not see this 5% effect.

With custom perf stat runs we also can measure cache request-to-miss ratio. For 4 KB program L1 data cache serves 99,99% of all loads and 99,999% of all stores. We can note that your incorrect test generate several times more requests to cache than it is needed to walk on array and increment every element (1 billion loads + 1 billion stores). Additional accesses are for working with local variables like x, they always served by cache because their address is constant)

$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a10
Size=1024 ITERATIONS=1048576, TIME=2412.25 ms

 Performance counter stats for './a10':

     5 375 195 765 L1-dcache-loads
           364 140 L1-dcache-load-misses     #    0,01% of all L1-dcache hits
     2 151 408 053 L1-dcache-stores
            13 350 L1-dcache-store-misses

For 4 MB program hit rate is many-many times worse. 100 times more misses! Now 1.2 % of all memory requests are served not by L1 but L2.

$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a20
Size=1048576 ITERATIONS=1024, TIME=2443.92 ms

 Performance counter stats for './a20':

     5 378 035 007 L1-dcache-loads
        67 725 008 L1-dcache-load-misses     #    1,26% of all L1-dcache hits
     2 152 183 588 L1-dcache-stores
        67 266 426 L1-dcache-store-misses

Isn't it a case when we want to notice how cache latency goes from 4 cpu ticks up to 12 (3 times longer), and when this change affects only 1.2% of cache requests, and when our program has only 7% slowdown sensitive to the cache latencies ???

What if we will use bigger data set? Ok, here is a25 (2^25 of 4-byte ints = 128 MB, several times more than cache size):

$ perf stat   ./a25
Size=134217728 ITERATIONS=8, TIME=2437.25 ms

 Performance counter stats for './a25':

           262 417 page-faults               #    0,090 M/sec
    10 214 588 827 cycles                    #    3,499 GHz
     6 272 114 853 stalled-cycles-frontend   #   61,40% frontend cycles idle
     1 098 632 880 stalled-cycles-backend    #   10,76% backend  cycles idle
    13 683 671 982 instructions              #    1,34  insns per cycle
                                             #    0,46  stalled cycles per insn
     1 274 410 549 branches                  #  436,519 M/sec
           315 656 branch-misses             #    0,02% of all branches

$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a25
Size=134217728 ITERATIONS=8, TIME=2444.13 ms

 Performance counter stats for './a25':

     6 138 410 226 L1-dcache-loads
        77 025 747 L1-dcache-load-misses     #    1,25% of all L1-dcache hits
     2 515 141 824 L1-dcache-stores
        76 320 695 L1-dcache-store-misses

Almost the same L1 miss rate, and more backend stalls. I was able to get stats on "cache-references,cache-misses" events ans I suggest they are about L3 cache (there is several times more requests to L2):

$ perf stat -e 'cache-references,cache-misses' ./a25
Size=134217728 ITERATIONS=8, TIME=2440.71 ms

 Performance counter stats for './a25':

        17 053 482 cache-references
        11 829 118 cache-misses              #   69,365 % of all cache refs

So, miss rate is high, but the test does 1 billion of (useful) loads, and only 0.08 billion of them misses L1. 0.01 billion of requests are served by memory. Memory latency is around 230 cpu clocks instead of 4 clock L1 latency. Is the test able to see this? May be, if the noise is low.

Awesome answer thank you! I have read through a couple of times and now and im rereading to digest completely but I do understand the gist, that my program is dominated by stalling for instructions to be decoded rather than cache issues so it is not possible to accurately measure the cache performance. Thank you also for posting all the steps in your tests and the commands as now I have some tools to repeat the test myself and play with it. — skimon, Apr 29 '14 at 14:15
I may have more questions later, but for now what is bugging me is: how did the original author do this test (in link) in original post, or is there not enough code there to know? Is it entirely dependent on what compiler/optimizations they used? I am partly annoyed they didnt specify exactly how to reproduce the issue, but im also happy they didnt because now I got to learn from your post! — skimon, Apr 29 '14 at 14:18
Original author is Chris Gottbrath, Product Manager in Rogue Wave Software, ex Entus (https://twitter.com/Chris_HPC). I'm almost sure that he is experienced technomage and used advanced compilers (and most advanced commercial cache profiler, ThreadSpotter, which is managed by him) for the test to make it to depend mostly on cache speed. The DrDobb's give little detail about the test and CPU used to run test. Only info: "L2 is 512 KB", "50 seconds is array is >512KB, 5 seconds for array < 512KB", "10x speed up, is consistent with the ratio of latency to the cache and latency to main memory"... — osgx, Apr 29 '14 at 16:17
But the 10:1 ratio between cache and memory latency is not what he lists above the "simple routine" source code. He lists Sandy Bridge (Intel Core i7), which has L3 between L2 and memory, its L3 latency is 11ns, mem lat is 60-70ns; also listed is AMD's 8-core Bulldozer (8150) again with L3 cache, 8 ns for L2, 24 ns for L3 and 60-70 ns for memory. No system with 512KB L2, without L3 and 10x slower memory.. I think the test was run on third unlisted system, which can be rather old or even on XeonPhi(???), possibly to get shiny "10x speed up".... I will try to recreate test with gcc -O3. — osgx, Apr 29 '14 at 16:25
Data for my test on single core Pentium 4 2.5GHz, 512KB L2: a5 5.7 sec, a10 5.7sec, a15 6.8sec, a20 7sec, a25 7sec. Cache effects are visible even with your test. — osgx, Apr 29 '14 at 18:22
I didn't read all the details, but the main reason why this is a bad test is that it doesn't measure cache read BW, but intermixes read/write bandwidths. Doing `array[x]++` introduces a read-modify-write sequence which will make the line dirty in the cache and force you to evict it all the way to the memory, taking up expensive cache/bus Bandwidth. @skimon, why not just read the values (and sum them in some accumulator to avoid optimizing them out) — Leeor, Apr 29 '14 at 20:28
Leeor, thank you for all your answers here, at SO. The problem with the test is that is is not a test of something (not measuring anything specific). It was just first wrong try to reproduce the code from linked DrDobbs article. RMW can be bad and it will consume available BW, but it has very little effect on this test because the BW, generated by test is too low: (for biggest tested a25 with array of 128MB) 1 billion of 4byte linear stores per second = 4GB per second (or 64 millions lines evicted per second). i7-2600 has 20 GB/s of memory BW, so evictions are not limiting factor for the test. — osgx, Apr 29 '14 at 21:40
You're looking at memory BW, there's also a limitation on the core boundary (the CPU/DRAM BW is designed to sustain several cores, not to mention other agents using it), but I totally agree that this isn't the biggest problem in this test (already +1'ed you :). Still, it's not that far from measuring true BW, and it could be a good exercise for the OP to make it do that. — Leeor, Apr 29 '14 at 22:00
Anyway, I don't think the frontend stalls are due to decoders. I'm no expert, but that should hardly ever be the problem as only very special cases require decoder0. Besides this CPU has a uop cache and loop stream detector to feed the backend efficiently. For starters i'd get rid of these template functions as they mess up the code size and cacheability — Leeor, Apr 29 '14 at 22:02
@osgx I was trying to reproduce your results but depending on which machine i use either both or one of the stalled-cycles counters are '' on my cpu. Do you know if this feature is simply removed from later model cpus ? ( I have a i7-4500U and i7-3770k ). 3770k seems to support front end stalled but not backend, the 4500U doesnt support either. I found an article on SO (http://stackoverflow.com/questions/22712956/why-does-perf-stat-show-stalled-cycles-backend-as-not-supported) but there doesnt seem to be a conclusive answer on this.(i tried their suggestions) — skimon, Apr 30 '14 at 14:59
@Leeor array[x]++ is what is prescribed by the article (granted, by someone currently a lot more knowledgeable in cache performance). I will try some things and try to post a better version of the code later that shows what he was originally talking about. As for the templates, im not sure that should make a difference since each template becomes its own function which should fit nicely in an instruction cache? Though I did try with simply 1 version at a time and i get the same behavior so i suspect it is related to what osgx said. — skimon, Apr 30 '14 at 15:05
skimon, the support of hw events depends both on CPU and Linux Kernel version. Use modern kernel or older CPUs. And with Sandy Bridge/Ivy Bridge or Haswell, you can use `ocperf` wrapper over `perf` from pmu-tools - https://github.com/andikleen/pmu-tools - https://lwn.net/Articles/556983/ The `ocperf` understands all intel event names from Intel's optimization manual http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf and there is also `toplev.py` script to investigate performance. `ocperf` will also support every HW event with older linux kernels. — osgx, Apr 30 '14 at 17:11
@osgx thanks for sharing the detail analysis. I wonder whether you have the source code for `a10` and `a20` you profiled (you only mentioned the code was from a member here). Thanks! — HCSF, Jan 10 '20 at 01:14
@HCSF, I think that it was code based on Joky's answer here: https://stackoverflow.com/a/23283006/196561. — osgx, Jan 11 '20 at 19:34

score 1 · Answer 2 · answered Apr 25 '14 at 02:04

Some results (OSX, Sandy Bridge):

GCC -O0

Size=1 ITERATIONS=1073741824, TIME=2416.06 ms
Size=2 ITERATIONS=536870912, TIME=1885.46 ms
Size=4 ITERATIONS=268435456, TIME=1782.92 ms
Size=16 ITERATIONS=67108864, TIME=2023.71 ms
Size=32 ITERATIONS=33554432, TIME=2184.99 ms
Size=64 ITERATIONS=16777216, TIME=2464.09 ms
Size=256 ITERATIONS=4194304, TIME=2358.31 ms
Size=1024 ITERATIONS=1048576, TIME=2333.77 ms
Size=2048 ITERATIONS=524288, TIME=2340.16 ms
Size=4096 ITERATIONS=262144, TIME=2349.97 ms
Size=8192 ITERATIONS=131072, TIME=2346.96 ms
Size=16384 ITERATIONS=65536, TIME=2350.3 ms
Size=32768 ITERATIONS=32768, TIME=2348.71 ms
Size=65536 ITERATIONS=16384, TIME=2355.28 ms
Size=262144 ITERATIONS=4096, TIME=2358.97 ms
Size=524288 ITERATIONS=2048, TIME=2476.46 ms
Size=1048576 ITERATIONS=1024, TIME=2429.07 ms
Size=2097152 ITERATIONS=512, TIME=2427.09 ms
Size=4194304 ITERATIONS=256, TIME=2443.42 ms
Size=8388608 ITERATIONS=128, TIME=2435.54 ms
Size=33554432 ITERATIONS=32, TIME=2389.08 ms
Size=134217728 ITERATIONS=8, TIME=2444.43 ms
Size=536870912 ITERATIONS=2, TIME=2600.91 ms

GCC -O3

Size=1 ITERATIONS=1073741824, TIME=2197.12 ms
Size=2 ITERATIONS=536870912, TIME=996.409 ms
Size=4 ITERATIONS=268435456, TIME=606.252 ms
Size=16 ITERATIONS=67108864, TIME=306.904 ms
Size=32 ITERATIONS=33554432, TIME=897.692 ms
Size=64 ITERATIONS=16777216, TIME=847.794 ms
Size=256 ITERATIONS=4194304, TIME=802.136 ms
Size=1024 ITERATIONS=1048576, TIME=761.971 ms
Size=2048 ITERATIONS=524288, TIME=760.136 ms
Size=4096 ITERATIONS=262144, TIME=759.149 ms
Size=8192 ITERATIONS=131072, TIME=749.881 ms
Size=16384 ITERATIONS=65536, TIME=756.672 ms
Size=32768 ITERATIONS=32768, TIME=759.565 ms
Size=65536 ITERATIONS=16384, TIME=754.81 ms
Size=262144 ITERATIONS=4096, TIME=745.899 ms
Size=524288 ITERATIONS=2048, TIME=749.527 ms
Size=1048576 ITERATIONS=1024, TIME=758.009 ms
Size=2097152 ITERATIONS=512, TIME=776.671 ms
Size=4194304 ITERATIONS=256, TIME=778.963 ms
Size=8388608 ITERATIONS=128, TIME=783.191 ms
Size=33554432 ITERATIONS=32, TIME=770.603 ms
Size=134217728 ITERATIONS=8, TIME=785.703 ms
Size=536870912 ITERATIONS=2, TIME=911.875 ms

(Note how the first one is really slower, I feel like there may be a mis-speculation somewhere around load-store forwarding...)

Interestingly turning the optimizations on and removing the volatile shows a somehow nicer curve:

Size=1 ITERATIONS=1073741824, TIME=0 ms
Size=2 ITERATIONS=536870912, TIME=0 ms
Size=4 ITERATIONS=268435456, TIME=0 ms
Size=16 ITERATIONS=67108864, TIME=0.001 ms
Size=32 ITERATIONS=33554432, TIME=125.581 ms
Size=64 ITERATIONS=16777216, TIME=140.654 ms
Size=256 ITERATIONS=4194304, TIME=217.559 ms
Size=1024 ITERATIONS=1048576, TIME=168.155 ms
Size=2048 ITERATIONS=524288, TIME=159.031 ms
Size=4096 ITERATIONS=262144, TIME=154.373 ms
Size=8192 ITERATIONS=131072, TIME=153.858 ms
Size=16384 ITERATIONS=65536, TIME=156.819 ms
Size=32768 ITERATIONS=32768, TIME=156.505 ms
Size=65536 ITERATIONS=16384, TIME=156.921 ms
Size=262144 ITERATIONS=4096, TIME=215.911 ms
Size=524288 ITERATIONS=2048, TIME=220.298 ms
Size=1048576 ITERATIONS=1024, TIME=235.648 ms
Size=2097152 ITERATIONS=512, TIME=320.284 ms
Size=4194304 ITERATIONS=256, TIME=409.433 ms
Size=8388608 ITERATIONS=128, TIME=431.743 ms
Size=33554432 ITERATIONS=32, TIME=429.436 ms
Size=134217728 ITERATIONS=8, TIME=430.052 ms
Size=536870912 ITERATIONS=2, TIME=535.773 ms

To help anyone reproduce the "issue", here is some standard (I hope) C++ code:

#include <string>
#include <iostream>
#include <chrono>
#include <cstdlib>
#include <memory>

template <unsigned int SIZE, unsigned int ITERATIONS>
void test_body(volatile int *array) {
    for (int i = 0; i < ITERATIONS; i++)
    {
        for (int  x = 0; x < SIZE; x++)
        {
            array[x]++;
        }
    }

}


template <unsigned int SIZE, unsigned int ITERATIONS>
static void test_function()
{
    static_assert(SIZE*ITERATIONS == 1024*1024*1024, "SIZE MISMATCH");
    std::unique_ptr<volatile int[]> array { new int[SIZE] };

    // Warmup
    test_body<SIZE, 1>(array.get());

    auto start = std::chrono::steady_clock::now();

    test_body<SIZE, ITERATIONS>(array.get());

    auto end = std::chrono::steady_clock::now();
    auto diff = end - start;
    std::cout << "Size=" << SIZE << " ITERATIONS=" << ITERATIONS << ", TIME=" << std::chrono::duration <double, std::milli> (diff).count() << " ms" << std::endl;
}

int main()
{
    enum { eIterations= 1024*1024*1024};
    test_function<1, eIterations>();
    test_function<2, eIterations/2>();
    test_function<4, eIterations/4>();
    test_function<16, eIterations/16>();
    test_function<32, eIterations/ 32>();
    test_function<64, eIterations/ 64>();
    test_function<256, eIterations/ 256>();
    test_function<1024, eIterations/ 1024>();
    test_function<2048, eIterations/ 2048>();
    test_function<4096, eIterations/ 4096>();
    test_function<8192, eIterations/ 8192>();
    test_function<16384, eIterations/ 16384>();
    test_function<32768, eIterations/ 32768>();
    test_function<65536, eIterations/ 65536>();
    test_function<262144, eIterations/ 262144>();
    test_function<524288, eIterations/ 524288>();
    test_function<1048576, eIterations/ 1048576>();
    test_function<2097152, eIterations/ 2097152>();
    test_function<4194304, eIterations/ 4194304>();
    test_function<8388608, eIterations/ 8388608>();
    test_function<33554432, eIterations/ 33554432>();
    test_function<134217728, eIterations/ 134217728>();
    test_function<536870912, eIterations/ 536870912>();
}

Thanks for doing this. I suspect if you enable optimizations and remove the volatile though I would expect the compiler optimize out the array entirely since its not used? — skimon, Apr 25 '14 at 02:10
Compiler is not smart enough :) However it save explicit flushes and allows to reorder instructions. — Joky, Apr 25 '14 at 02:11
It also vectorize, and mess-up multiple thing. I just put it here for information, it not representative... — Joky, Apr 25 '14 at 02:25

Roy Longbottom · Answer 3 · 2014-04-29T20:07:37.213

It seems clear that constant time implies a constant instruction execution rate. To measure cache/RAM speed, data transfer instructions should predominate and results require further clarification than run time, like MB/second and instructions per second. You need something like my BusSpeed benchmark (Google for Roy BusSpeed benchmark or BusSpd2k for source codes and results with versions for Windows, Linux and Android). The original used assembly code with instructions like:

   "add     edx,ecx"     \
   "mov     ebx,[edi]"   \
   "mov     ecx,ebx"     \
"lp: and     ebx,[edx]"   \
   "and     ecx,[edx+4]"   \
   "and     ebx,[edx+8]"   \
   "and     ecx,[edx+12]"   \
   "and     ebx,[edx+16]"   \
   "and     ecx,[edx+20]"   \
   "and     ebx,[edx+24]"   \
   "and     ecx,[edx+28]"   \
   "and     ebx,[edx+32]"   \
   "and     ecx,[edx+36]"   \
   "and     ebx,[edx+40]"   \

 To

   "and     ecx,[edx+236]"   \
   "and     ebx,[edx+240]"   \
   "and     ecx,[edx+244]"   \
   "and     ebx,[edx+248]"   \
   "and     ecx,[edx+252]"   \
   "add     edx,256"     \
   "dec     eax"         \
   "jnz     lp"          \
   "and     ebx,ecx"     \
   "mov     [edi],ebx"     \

Later versions used C as follows

void inc1word()
{
   int i, j;

   for(j=0; j<passes1; j++)
   {
       for (i=0; i<wordsToTest; i=i+64)
       {
           andsum1 = andsum1 & array[i   ] & array[i+1 ] & array[i+2 ] & array[i+3 ]
                             & array[i+4 ] & array[i+5 ] & array[i+6 ] & array[i+7 ]
                             & array[i+8 ] & array[i+9 ] & array[i+10] & array[i+11]
                             & array[i+12] & array[i+13] & array[i+14] & array[i+15]
                             & array[i+16] & array[i+17] & array[i+18] & array[i+19]
                             & array[i+20] & array[i+21] & array[i+22] & array[i+23]
                             & array[i+24] & array[i+25] & array[i+26] & array[i+27]
                             & array[i+28] & array[i+29] & array[i+30] & array[i+31]
                             & array[i+32] & array[i+33] & array[i+34] & array[i+35]
                             & array[i+36] & array[i+37] & array[i+38] & array[i+39]
                             & array[i+40] & array[i+41] & array[i+42] & array[i+43]
                             & array[i+44] & array[i+45] & array[i+46] & array[i+47]
                             & array[i+48] & array[i+49] & array[i+50] & array[i+51]
                             & array[i+52] & array[i+53] & array[i+54] & array[i+55]
                             & array[i+56] & array[i+57] & array[i+58] & array[i+59]
                             & array[i+60] & array[i+61] & array[i+62] & array[i+63];
       }
   }
}

The benchmark measures MB/second of caches and RAM, including skipped sequential addressing to see where data is read in bursts. Example results follow. Note burst reading effects and reading to two different registers (Reg2, from assembly code version) can be faster than to 1. Then, in this case, loading every word to 1 register (AndI, Reg1, Inc4 bytes) produces almost constant speeds (around 1400 MIPS). So, even a long sequence of instructions might not suit particular pipelines). The way to find out is to run a wider variation of your tests.

######################################################################### Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz Measured 2807 MHz

         Windows Bus Speed Test Version 2.2 by Roy Longbottom

  Minimum      0.100 seconds per test, Start Fri Jul 30 16:43:56 2010

          MovI  MovI  MovI  MovI  MovI  MovI  AndI  AndI  MovM  MovM
  Memory  Reg2  Reg2  Reg2  Reg2  Reg1  Reg2  Reg1  Reg2  Reg1  Reg8
  KBytes Inc64 Inc32 Inc16  Inc8  Inc4  Inc4  Inc4  Inc4  Inc8  Inc8
   Used   MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S  MB/S

      4  10025 10800 11262 11498 11612 11634  5850 11635 23093 23090
      8  10807 11267 11505 11627 11694 11694  5871 11694 23299 23297
     16  11251 11488 11620 11614 11712 11719  5873 11718 23391 23398
     32   9893  9853 10890 11170 11558 11492  5872 11466 21032 21025
     64   3219  4620  7289  9479 10805 10805  5875 10797 14426 14426
    128   3213  4805  7305  9467 10811 10810  5875 10805 14442 14408
    256   3144  4592  7231  9445 10759 10733  5870 10743 14336 14337
    512   2005  3497  5980  9056 10466 10467  5871 10441 13906 13905
   1024   2003  3482  5974  9017 10468 10466  5874 10467 13896 13818
   2048   2004  3497  5958  9088 10447 10448  5870 10447 13857 13857
   4096   1963  3398  5778  8870 10328 10328  5851 10328 13591 13630
   8192   1729  3045  5322  8270  9977  9963  5728  9965 12923 12892
  16384    692  1402  2495  4593  7811  7782  5406  7848  8335  8337
  32768    695  1406  2492  4584  7820  7826  5401  7792  8317  8322
  65536    695  1414  2488  4584  7823  7826  5403  7800  8321  8321
 131072    696  1402  2491  4575  7827  7824  5411  7846  8322  8323
 262144    696  1413  2498  4594  7791  7826  5409  7829  8333  8334
 524288    693  1416  2498  4595  7841  7842  5411  7847  8319  8285
1048576    704  1415  2478  4591  7845  7840  5410  7853  8290  8283

                  End of test Fri Jul 30 16:44:29 2010

MM uses 1 and 8 MMX registers, later versions use SSE

Source codes and execution files are free for anyone to play with. Files are in following where array declarations are shown:

Windows http://www.roylongbottom.org.uk/busspd2k.zip

 xx = (int *)VirtualAlloc(NULL, useMemK*1024+256, MEM_COMMIT, PAGE_READWRITE);

Linux http://www.roylongbottom.org.uk/memory_benchmarks.tar.gz

#ifdef Bits64
   array = (long long *)_mm_malloc(memoryKBytes[ipass-1]*1024, 16);
#else
   array = (int *)_mm_malloc(memoryKBytes[ipass-1]*1024, 16);

Results and other links (MP version, Android) are in:

http://www.roylongbottom.org.uk/busspd2k%20results.htm

Roy, thank you for code of the main loop. How did you declare the `array` variable? Is your benchmark opensourced or not? — osgx, Apr 29 '14 at 13:52

score 0 · Answer 4 · answered Apr 29 '14 at 12:22

I don't get constant time. I modified your code a bit to make it simpler. My times are much lower than yours. I'm not sure why. The large times at the beginning make sense because there are only a few values to write to so it's a dependency chain. The L2 Cache ends at 256k/4=64k. Notice how the values start rising between size=32768 and 65536.

//GCC -O3 Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz
Size=1, Iterations=1073741824, Time=187.18 ms
Size=2, Iterations=536870912, Time=113.47 ms
Size=4, Iterations=268435456, Time=50.53 ms
Size=8, Iterations=134217728, Time=25.02 ms
Size=16, Iterations=67108864, Time=25.61 ms
Size=32, Iterations=33554432, Time=24.08 ms
Size=64, Iterations=16777216, Time=22.69 ms
Size=128, Iterations=8388608, Time=22.03 ms
Size=256, Iterations=4194304, Time=19.98 ms
Size=512, Iterations=2097152, Time=17.09 ms
Size=1024, Iterations=1048576, Time=15.66 ms
Size=2048, Iterations=524288, Time=14.94 ms
Size=4096, Iterations=262144, Time=14.58 ms
Size=8192, Iterations=131072, Time=14.40 ms
Size=16384, Iterations=65536, Time=14.63 ms
Size=32768, Iterations=32768, Time=14.75 ms
Size=65536, Iterations=16384, Time=18.58 ms
Size=131072, Iterations=8192, Time=20.51 ms
Size=262144, Iterations=4096, Time=21.18 ms
Size=524288, Iterations=2048, Time=21.26 ms
Size=1048576, Iterations=1024, Time=21.22 ms
Size=2097152, Iterations=512, Time=22.17 ms
Size=4194304, Iterations=256, Time=38.01 ms
Size=8388608, Iterations=128, Time=38.63 ms
Size=16777216, Iterations=64, Time=38.09 ms
Size=33554432, Iterations=32, Time=38.54 ms
Size=67108864, Iterations=16, Time=39.11 ms
Size=134217728, Iterations=8, Time=39.96 ms
Size=268435456, Iterations=4, Time=42.15 ms
Size=536870912, Iterations=2, Time=46.39 ms

The code:

#include <stdio.h>
#include <omp.h>

static void test_function(int n, int iterations)
{
    int *array = new int[n];
    for (int i = 0; i < iterations; i++)
        for (int x = 0; x < n; x++)
          array[x]++;
    delete[] array;
}

int main() {        
    for(int i=0, n=1, iterations=1073741824; i<30; i++, n*=2, iterations/=2) {
        double dtime;
        dtime = omp_get_wtime();
        test_function(n, iterations);
        dtime = omp_get_wtime() - dtime;
        printf("Size=%d, Iterations=%d, Time=%.3f\n", n, iterations, dtime);
    }
}

Z boson, you changed type of array, declaring it without volatile. With optimization enabled, compiler may optimize loop in test_function into something like `for (int x = 0; x < n; x++) array[x] += iterations`. Please, check you assembler code or post it here. — osgx, May 02 '14 at 01:34

C++ Cache performance odd behavior

4 Answers4

GCC -O0

GCC -O3

Linked