Prefetching Examples?

Question

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I'd like the example to meet the following criteria:

It is a simple, small, self-contained example.
Removing the __builtin_prefetch instruction results in performance degradation.
Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation.

That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn't be managed without it.

score 71 · Accepted Answer · edited Oct 30 '12 at 22:55

Here's an actual piece of code that I've pulled out of a larger project. (Sorry, it's the shortest one I can find that had a noticable speedup from prefetching.) This code performs a very large data transpose.

This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.

To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.

#include <iostream>
using std::cout;
using std::endl;

#include <emmintrin.h>
#include <malloc.h>
#include <time.h>
#include <string.h>

#define ENABLE_PREFETCH


#define f_vector    __m128d
#define i_ptr       size_t
inline void swap_block(f_vector *A,f_vector *B,i_ptr L){
    //  To be super-optimized later.

    f_vector *stop = A + L;

    do{
        f_vector tmpA = *A;
        f_vector tmpB = *B;
        *A++ = tmpB;
        *B++ = tmpA;
    }while (A < stop);
}
void transpose_even(f_vector *T,i_ptr block,i_ptr x){
    //  Transposes T.
    //  T contains x columns and x rows.
    //  Each unit is of size (block * sizeof(f_vector)) bytes.

    //Conditions:
    //  - 0 < block
    //  - 1 < x

    i_ptr row_size = block * x;
    i_ptr iter_size = row_size + block;

    //  End of entire matrix.
    f_vector *stop_T = T + row_size * x;
    f_vector *end = stop_T - row_size;

    //  Iterate each row.
    f_vector *y_iter = T;
    do{
        //  Iterate each column.
        f_vector *ptr_x = y_iter + block;
        f_vector *ptr_y = y_iter + row_size;

        do{

#ifdef ENABLE_PREFETCH
            _mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0);
#endif

            swap_block(ptr_x,ptr_y,block);

            ptr_x += block;
            ptr_y += row_size;
        }while (ptr_y < stop_T);

        y_iter += iter_size;
    }while (y_iter < end);
}
int main(){

    i_ptr dimension = 4096;
    i_ptr block = 16;

    i_ptr words = block * dimension * dimension;
    i_ptr bytes = words * sizeof(f_vector);

    cout << "bytes = " << bytes << endl;
//    system("pause");

    f_vector *T = (f_vector*)_mm_malloc(bytes,16);
    if (T == NULL){
        cout << "Memory Allocation Failure" << endl;
        system("pause");
        exit(1);
    }
    memset(T,0,bytes);

    //  Perform in-place data transpose
    cout << "Starting Data Transpose...   ";
    clock_t start = clock();
    transpose_even(T,block,dimension);
    clock_t end = clock();

    cout << "Done" << endl;
    cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl;

    _mm_free(T);
    system("pause");
}

When I run it with ENABLE_PREFETCH enabled, this is the output:

bytes = 4294967296
Starting Data Transpose...   Done
Time: 0.725 seconds
Press any key to continue . . .

When I run it with ENABLE_PREFETCH disabled, this is the output:

bytes = 4294967296
Starting Data Transpose...   Done
Time: 0.822 seconds
Press any key to continue . . .

So there's a 13% speedup from prefetching.

EDIT:

Here's some more results:

Operating System: Windows 7 Professional/Ultimate
Compiler: Visual Studio 2010 SP1
Compile Mode: x64 Release

Intel Core i7 860 @ 2.8 GHz, 8 GB DDR3 @ 1333 MHz
Prefetch   : 0.868
No Prefetch: 0.960

Intel Core i7 920 @ 3.5 GHz, 12 GB DDR3 @ 1333 MHz
Prefetch   : 0.725
No Prefetch: 0.822

Intel Core i7 2600K @ 4.6 GHz, 16 GB DDR3 @ 1333 MHz
Prefetch   : 0.718
No Prefetch: 0.796

2 x Intel Xeon X5482 @ 3.2 GHz, 64 GB DDR2 @ 800 MHz
Prefetch   : 2.273
No Prefetch: 2.666

Interesting. Unfortunately on the two machines I tested (Macbook Pro with "Core 2 Duo" and a Linux machine with a "Quad-Core AMD Opteron Processor 2376") I didn't get a significant difference between the two versions. I suspect it has to do with cache size -- it looks you have a better machine than those two. What do you think? — Shaun Harker, Sep 07 '11 at 03:24
My machine is a Core i7 920 @ 3.5 GHz. 8MB L3 cache. This 10% speedup is more or less consistent on 3 other machines that I've tested: Core i7 2600K @ 4.6 GHz, and 2 x Xeon X5482 @ 3.2 GHz. But I'll admit that I've never tested it on a laptop or an AMD machine. — Mysticial, Sep 07 '11 at 03:39
I just edited my answer with the benchmarks on all 4 machines that I tested. They're all Intel desktops/workstations. So that could be the reason. I didn't test if your 3rd point holds. It could be that substituting it with a memory access could produce the same result. — Mysticial, Sep 07 '11 at 03:49
The third point is tricky to test due to the out-of-order execution. In order for the third point to hold, you will need to have some 100 - 200 instructions between the load to when it's actually used. A stalled load will block the pipeline after the re-order buffer is filled up. But a prefetch won't. The only time you'll see the penalty of the stalled load is when you actually have enough instructions to overflow the re-order buffer... If you just replace my prefetch with a normal load, the compiler will probably optimize out the load as dead code... (which satisfies your last point, lol) — Mysticial, Sep 09 '11 at 06:18
Yes, you'd have to add some sort of "dummy" thing to take in the memory access and then print its value so it wouldn't be optimized away -- that's what i have been doing. Can you give me a link to information regarding what you are discussing about stalled loads and re-order buffers? I think that might do me a world of good. — Shaun Harker, Sep 10 '11 at 05:07
That's a big topic. It took a semester course for me to learn it. Try googling around for "out-of-order execution" and "Tomasulo algorithm". That's just the tip of the iceberg, but it's a good start. — Mysticial, Sep 10 '11 at 06:51
All right then. I think that will answer my deeper questions, so I better give you the win on this. — Shaun Harker, Sep 11 '11 at 05:07
Just for the record: 0.59/0.67 seconds on my system with 1833 MHz memory. — Gunther Piez, Oct 30 '12 at 23:01

score 43 · Answer 2 · answered Jul 28 '15 at 22:18

Binary search is a simple example that could benefit from explicit prefetching. The access pattern in a binary search looks pretty much random to the hardware prefetcher, so there is little chance that it will accurately predict what to fetch.

In this example, I prefetch the two possible 'middle' locations of the next loop iteration in the current iteration. One of the prefetches will probably never be used, but the other will (unless this is the final iteration).

 #include <time.h>
 #include <stdio.h>
 #include <stdlib.h>

 int binarySearch(int *array, int number_of_elements, int key) {
         int low = 0, high = number_of_elements-1, mid;
         while(low <= high) {
                 mid = (low + high)/2;
            #ifdef DO_PREFETCH
            // low path
            __builtin_prefetch (&array[(mid + 1 + high)/2], 0, 1);
            // high path
            __builtin_prefetch (&array[(low + mid - 1)/2], 0, 1);
            #endif

                 if(array[mid] < key)
                         low = mid + 1; 
                 else if(array[mid] == key)
                         return mid;
                 else if(array[mid] > key)
                         high = mid-1;
         }
         return -1;
 }
 int main() {
     int SIZE = 1024*1024*512;
     int *array =  malloc(SIZE*sizeof(int));
     for (int i=0;i<SIZE;i++){
       array[i] = i;
     }
     int NUM_LOOKUPS = 1024*1024*8;
     srand(time(NULL));
     int *lookups = malloc(NUM_LOOKUPS * sizeof(int));
     for (int i=0;i<NUM_LOOKUPS;i++){
       lookups[i] = rand() % SIZE;
     }
     for (int i=0;i<NUM_LOOKUPS;i++){
       int result = binarySearch(array, SIZE, lookups[i]);
     }
     free(array);
     free(lookups);
 }

When I compile and run this example with DO_PREFETCH enabled, I see a 20% reduction in runtime:

 $ gcc c-binarysearch.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3
 $ gcc c-binarysearch.c -o no-prefetch -std=c11 -O3

 $ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./with-prefetch 

  Performance counter stats for './with-prefetch':

    356,675,702      L1-dcache-load-misses     #   41.39% of all L1-dcache hits  
   861,807,382      L1-dcache-loads                                             

   8.787467487 seconds time elapsed

 $ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./no-prefetch 

 Performance counter stats for './no-prefetch':

   382,423,177      L1-dcache-load-misses     #   97.36% of all L1-dcache hits  
   392,799,791      L1-dcache-loads                                             

  11.376439030 seconds time elapsed

Notice that we are doing twice as many L1 cache loads in the prefetch version. We're actually doing a lot more work but the memory access pattern is more friendly to the pipeline. This also shows the tradeoff. While this block of code runs faster in isolation, we have loaded a lot of junk into the caches and this may put more pressure on other parts of the application.

Can one combine builtin prefetch with AVX instructions if data is aligned ? The results of a prefetch could be loaded into a AVX register to do comparisons ? — gansub, Aug 11 '20 at 12:03
I ran some tests with this code and found that compiler optimisations actually made cache hit metrics worse and produced slower binaries. To see, compile and run code without any optimisation flags. — Mah35h, Aug 15 '21 at 05:47
When I run your program, I get the similarly scaled results, but when I run a Google Benchmark trying to isolate the search, I find prefetching to be 50% worse. — kolbe, Feb 16 '23 at 18:47

score 8 · Answer 3 · answered May 10 '18 at 19:21

I learned a lot from the excellent answers provided by @JamesScriven and @Mystical. However, their examples give only a modest boost - the objective of this answer is to present a (I must confess somewhat artificial) example, where prefetching has a bigger impact (about factor 4 on my machine).

There are three possible bottle-necks for the modern architectures: CPU-speed, memory-band-width and memory latency. Prefetching is all about reducing the latency of the memory-accesses.

In a perfect scenario, where latency corresponds to X calculation-steps, we would have a oracle, which would tell us which memory we would access in X calculation-steps, the prefetching of this data would be launched and it would arrive just in-time X calculation-steps later.

For a lot of algorithms we are (almost) in this perfect world. For a simple for-loop it is easy to predict which data will be needed X steps later. Out-of-order execution and other hardware tricks are doing a very good job here, concealing the latency almost completely.

That is the reason, why there is such a modest improvement for @Mystical's example: The prefetcher is already pretty good - there is just not much room for improvement. The task is also memory-bound, so probably not much band-width is left - it could be becoming the limiting factor. I could see at best around 8% improvement on my machine.

The crucial insight from the @JamesScriven example: neither we nor the CPU knows the next access-address before the the current data is fetched from memory - this dependency is pretty important, otherwise out-of-order execution would lead to a look-forward and the hardware would be able to prefetch the data. However, because we can speculate about only one step there is not that much potential. I was not able to get more than 40% on my machine.

So let's rig the competition and prepare the data in such a way that we know which address is accessed in X steps, but make it impossible for hardware to find it out due to dependencies on not yet accessed data (see the whole program at the end of the answer):

//making random accesses to memory:
unsigned int next(unsigned int current){
   return (current*10001+328)%SIZE;
}

//the actual work is happening here
void operator()(){

    //set up the oracle - let see it in the future oracle_offset steps
    unsigned int prefetch_index=0;
    for(int i=0;i<oracle_offset;i++)
        prefetch_index=next(prefetch_index);

    unsigned int index=0;
    for(int i=0;i<STEP_CNT;i++){
        //use oracle and prefetch memory block used in a future iteration
        if(prefetch){
            __builtin_prefetch(mem.data()+prefetch_index,0,1);    
        }

        //actual work, the less the better
        result+=mem[index];

        //prepare next iteration
        prefetch_index=next(prefetch_index);  #update oracle
        index=next(mem[index]);               #dependency on `mem[index]` is VERY important to prevent hardware from predicting future
    }
}

Some remarks:

data is prepared in such a way, that the oracle is alway right.
maybe surprisingly, the less CPU-bound task the bigger the speed-up: we are able to hide the latency almost completely, thus the speed-up is CPU-time+original-latency-time/CPU-time.

Compiling and executing leads:

>>> g++ -std=c++11 prefetch_demo.cpp -O3 -o prefetch_demo
>>> ./prefetch_demo
#preloops   time no prefetch    time prefetch   factor
...
7   1.0711102260000001  0.230566831 4.6455521002498408
8   1.0511602149999999  0.22651144600000001 4.6406494398521474
9   1.049024333 0.22841439299999999 4.5926367389641687
....

to a speed-up between 4 and 5.

Listing of prefetch_demp.cpp:

//prefetch_demo.cpp

#include <vector>
#include <iostream>
#include <iomanip>
#include <chrono>

const int SIZE=1024*1024*1;
const int STEP_CNT=1024*1024*10;

unsigned int next(unsigned int current){
   return (current*10001+328)%SIZE;
}


template<bool prefetch>
struct Worker{
   std::vector<int> mem;

   double result;
   int oracle_offset;

   void operator()(){
        unsigned int prefetch_index=0;
        for(int i=0;i<oracle_offset;i++)
            prefetch_index=next(prefetch_index);

        unsigned int index=0;
        for(int i=0;i<STEP_CNT;i++){
            //prefetch memory block used in a future iteration
            if(prefetch){
                __builtin_prefetch(mem.data()+prefetch_index,0,1);    
            }
            //actual work:
            result+=mem[index];

            //prepare next iteration
            prefetch_index=next(prefetch_index);
            index=next(mem[index]);
        }
   }

   Worker(std::vector<int> &mem_):
       mem(mem_), result(0.0), oracle_offset(0)
   {}
};

template <typename Worker>
    double timeit(Worker &worker){
    auto begin = std::chrono::high_resolution_clock::now();
    worker();
    auto end = std::chrono::high_resolution_clock::now();
    return std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9;
}


 int main() {
     //set up the data in special way!
     std::vector<int> keys(SIZE);
     for (int i=0;i<SIZE;i++){
       keys[i] = i;
     }

     Worker<false> without_prefetch(keys);
     Worker<true> with_prefetch(keys);

     std::cout<<"#preloops\ttime no prefetch\ttime prefetch\tfactor\n";
     std::cout<<std::setprecision(17);

     for(int i=0;i<20;i++){
         //let oracle see i steps in the future:
         without_prefetch.oracle_offset=i;
         with_prefetch.oracle_offset=i;

         //calculate:
         double time_with_prefetch=timeit(with_prefetch);
         double time_no_prefetch=timeit(without_prefetch);

         std::cout<<i<<"\t"
                  <<time_no_prefetch<<"\t"
                  <<time_with_prefetch<<"\t"
                  <<(time_no_prefetch/time_with_prefetch)<<"\n";
     }

 }

Trying this with MSVC 19.33, I actually seem to get around a ~7.5x speedup, for pretty much all the `oracle_offset` values. Even for `0`, it looks like the compiler unrolls the loop and does 10 iterations at a time together with offsetting the prefetch instructions by 1 (roughly so: `prefetch(a); read(x); prefetch(b); read(a);`). Thanks for the great example. — Jake1234, Nov 03 '22 at 01:40

score 0 · Answer 4 · answered Sep 23 '18 at 21:37

Pre-fetching data can be optimized to the Cache Line size, which for most modern 64-bit processors is 64 bytes to for example pre-load a uint32_t[16] with one instruction.

For example on ArmV8 I discovered through experimentation casting the memory pointer to a uint32_t 4x4 matrix vector (which is 64 bytes in size) halved the required instructions required as before I had to increment by 8 as it was only loading half the data, even though my understanding was that it fetches a full cache line.

Pre-fetching an uint32_t[32] original code example...

int addrindex = &B[0];
    __builtin_prefetch(&V[addrindex]);
    __builtin_prefetch(&V[addrindex + 8]);
    __builtin_prefetch(&V[addrindex + 16]);
    __builtin_prefetch(&V[addrindex + 24]);

After...

int addrindex = &B[0];
__builtin_prefetch((uint32x4x4_t *) &V[addrindex]);
__builtin_prefetch((uint32x4x4_t *) &V[addrindex + 16]);

For some reason int datatype for the address index/offset gave better performance. Tested with GCC 8 on Cortex-a53. Using an equivalent 64 byte vector on other architectures might give the same performance improvement if you find it is not pre-fetching all the data like in my case. In my application with a one million iteration loop, it improved performance by 5% just by doing this. There were further requirements for the improvement.

the 128 megabyte "V" memory allocation had to be aligned to 64 bytes.

uint32_t *V __attribute__((__aligned__(64))) = (uint32_t *)(((uintptr_t)(__builtin_assume_aligned((unsigned char*)aligned_alloc(64,size), 64)) + 63) & ~ (uintptr_t)(63));

Also, I had to use C operators instead of Neon Intrinsics, since they require regular datatype pointers (in my case it was uint32_t *) otherwise the new built in prefetch method had a performance regression.

My real world example can be found at https://github.com/rollmeister/veriumMiner/blob/main/algo/scrypt.c in the scrypt_core() and its internal function which are all easy to read. The hard work is done by GCC8. Overall improvement to performance was 25%.

score 0 · Answer 5 · answered Sep 07 '11 at 01:47

0

From the documentation:

      for (i = 0; i < n; i++)
        {
          a[i] = a[i] + b[i];
          __builtin_prefetch (&a[i+j], 1, 1);
          __builtin_prefetch (&b[i+j], 0, 1);
          /* ... */
        }

answered Sep 07 '11 at 01:47

wallyk

56,922
16
83
148

18

I expect that the CPU's hardware prefetcher, would have prefetched this anyway. This is usually the cause of people discovering that "prefetch does nothing" - it really requires that the access pattern is something that a reasonably simple piece of logic, analyzing access patterns cannot predict. – Crowley9 Sep 08 '11 at 01:48
1

@Crowley9 I disagree that this is a bad answer. The OP wanted a simple example (probably to know how to use it), this answers on that. – a3mlord Mar 11 '15 at 19:21
1

Older CPUs with less smart hardware prefetching benefited from software prefetching in more cases. I think even P4 would have been smart enough to HW prefetch sequential accesses to contiguous data, though. This is a terrible example because it's a case where the extra prefetch instructions just slow things down. @a3mlord: The OP wanted a performance win, not just the syntax. – Peter Cordes Aug 03 '15 at 02:41
This example is too short to answer the question. – rookiepig Jul 13 '16 at 10:13

Prefetching Examples?

5 Answers5

Linked

Related