I have simple C code that does this (pseudo code):
#define N 100000000
int *DataSrc = (int *) malloc(N);
int *DataDest = (int *) malloc(N);
memset(DataSrc, 0, N);
for (int i = 0 ; i < 4 ; i++) {
    StartTimer();
    memcpy(DataDest, DataSrc, N);
    StopTimer();
}
printf("%d\n", DataDest[RandomInteger]);
My PC: Intel Core i7-3930, with 4x4GB DDR3 1600 memory running RedHat 6.1 64-bit.
The first memcpy() occurs at 1.9 GB/sec, while the next three occur at 6.2 GB/s.
The buffer size (N) is too big for this to be caused by cache effects.  So, my first Question:
- Why is the first memcpy()so much slower? Maybemalloc()doesn't fully allocate the memory until you use it?
If I eliminate the memset(), then the first memcpy() runs at about 1.5 GB/sec,
but the next three run at 11.8 GB/sec.  Almost 2x speedup.  My second question:
- Why is memcpy()2x faster if I don't callmemset()?
 
     
     
    