I learned a lot from the excellent answers provided by @JamesScriven and @Mystical. However, their examples give only a modest boost - the objective of this answer is to present a (I must confess somewhat artificial) example, where prefetching has a bigger impact (about factor 4 on my machine).
There are three possible bottle-necks for the modern architectures: CPU-speed, memory-band-width and memory latency. Prefetching is all about reducing the latency of the memory-accesses.
In a perfect scenario, where latency corresponds to X calculation-steps, we would have a oracle, which would tell us which memory we would access in X calculation-steps, the prefetching of this data would be launched and it would arrive just in-time X calculation-steps later.
For a lot of algorithms we are (almost) in this perfect world. For a simple for-loop it is easy to predict which data will be needed X steps later. Out-of-order execution and other hardware tricks are doing a very good job here, concealing the latency almost completely.
That is the reason, why there is such a modest improvement for @Mystical's example: The prefetcher is already pretty good - there is just not much room for improvement. The task is also memory-bound, so probably not much band-width is left - it could be becoming the limiting factor. I could see at best around 8% improvement on my machine.
The crucial insight from the @JamesScriven example: neither we nor the CPU knows the next access-address before the the current data is fetched from memory - this dependency is pretty important, otherwise out-of-order execution would lead to a look-forward and the hardware would be able to prefetch the data. However, because we can speculate about only one step there is not that much potential. I was not able to get more than 40% on my machine.
So let's rig the competition and prepare the data in such a way that we know which address is accessed in  X steps, but make it impossible for hardware to find it out due to dependencies on not yet accessed data (see the whole program at the end of the answer):
//making random accesses to memory:
unsigned int next(unsigned int current){
   return (current*10001+328)%SIZE;
}
//the actual work is happening here
void operator()(){
    //set up the oracle - let see it in the future oracle_offset steps
    unsigned int prefetch_index=0;
    for(int i=0;i<oracle_offset;i++)
        prefetch_index=next(prefetch_index);
    unsigned int index=0;
    for(int i=0;i<STEP_CNT;i++){
        //use oracle and prefetch memory block used in a future iteration
        if(prefetch){
            __builtin_prefetch(mem.data()+prefetch_index,0,1);    
        }
        //actual work, the less the better
        result+=mem[index];
        //prepare next iteration
        prefetch_index=next(prefetch_index);  #update oracle
        index=next(mem[index]);               #dependency on `mem[index]` is VERY important to prevent hardware from predicting future
    }
}
Some remarks:
- data is prepared in such a way, that the oracle is alway right.
- maybe surprisingly, the less CPU-bound task the bigger the speed-up: we are able to hide the latency almost completely, thus the speed-up is CPU-time+original-latency-time/CPU-time.
Compiling and executing leads:
>>> g++ -std=c++11 prefetch_demo.cpp -O3 -o prefetch_demo
>>> ./prefetch_demo
#preloops   time no prefetch    time prefetch   factor
...
7   1.0711102260000001  0.230566831 4.6455521002498408
8   1.0511602149999999  0.22651144600000001 4.6406494398521474
9   1.049024333 0.22841439299999999 4.5926367389641687
....
to a speed-up between 4 and 5.
Listing of prefetch_demp.cpp:
//prefetch_demo.cpp
#include <vector>
#include <iostream>
#include <iomanip>
#include <chrono>
const int SIZE=1024*1024*1;
const int STEP_CNT=1024*1024*10;
unsigned int next(unsigned int current){
   return (current*10001+328)%SIZE;
}
template<bool prefetch>
struct Worker{
   std::vector<int> mem;
   double result;
   int oracle_offset;
   void operator()(){
        unsigned int prefetch_index=0;
        for(int i=0;i<oracle_offset;i++)
            prefetch_index=next(prefetch_index);
        unsigned int index=0;
        for(int i=0;i<STEP_CNT;i++){
            //prefetch memory block used in a future iteration
            if(prefetch){
                __builtin_prefetch(mem.data()+prefetch_index,0,1);    
            }
            //actual work:
            result+=mem[index];
            //prepare next iteration
            prefetch_index=next(prefetch_index);
            index=next(mem[index]);
        }
   }
   Worker(std::vector<int> &mem_):
       mem(mem_), result(0.0), oracle_offset(0)
   {}
};
template <typename Worker>
    double timeit(Worker &worker){
    auto begin = std::chrono::high_resolution_clock::now();
    worker();
    auto end = std::chrono::high_resolution_clock::now();
    return std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9;
}
 int main() {
     //set up the data in special way!
     std::vector<int> keys(SIZE);
     for (int i=0;i<SIZE;i++){
       keys[i] = i;
     }
     Worker<false> without_prefetch(keys);
     Worker<true> with_prefetch(keys);
     std::cout<<"#preloops\ttime no prefetch\ttime prefetch\tfactor\n";
     std::cout<<std::setprecision(17);
     for(int i=0;i<20;i++){
         //let oracle see i steps in the future:
         without_prefetch.oracle_offset=i;
         with_prefetch.oracle_offset=i;
         //calculate:
         double time_with_prefetch=timeit(with_prefetch);
         double time_no_prefetch=timeit(without_prefetch);
         std::cout<<i<<"\t"
                  <<time_no_prefetch<<"\t"
                  <<time_with_prefetch<<"\t"
                  <<(time_no_prefetch/time_with_prefetch)<<"\n";
     }
 }