I went on to test memcpy behavior on my system after seeing this Why does the speed of memcpy() drop dramatically every 4KB?
Details of my system:
arun@arun-OptiPlex-9010:~/mem_copy_test$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Stepping:              9
CPU MHz:               1600.000
BogoMIPS:              6784.45
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7
arun@arun-OptiPlex-9010:~/mem_copy_test$ cat /proc/cpuinfo | grep 'model name'| head -1
model name  : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
arun@arun-OptiPlex-9010:~/mem_copy_test$ uname -a
Linux arun-OptiPlex-9010 3.13.0-40-generic #69-Ubuntu 
SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Test program:
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#include <string.h>
void memcpy_speed(unsigned long buf_size, unsigned long iters)
{
    struct timeval start,  end;
    unsigned char * pbuff_1;
    unsigned char * pbuff_2;
    int i;
    pbuff_1 = (void *)malloc(buf_size);
    pbuff_2 = (void *)malloc(buf_size);
    gettimeofday(&start, NULL);
    for(i = 0; i < iters; ++i){
        memcpy(pbuff_2, pbuff_1, buf_size);
    }   
    gettimeofday(&end, NULL);
    printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
    start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
    free(pbuff_1);
    free(pbuff_2);
}
main()
{
    unsigned long buf_size;
    unsigned int i;
    buf_size = 1;
    for (i = 1; i < 16385 ; i++) {
        printf("bufsize in kb=%d speed=", i);
        buf_size = i * 1024;
        memcpy_speed(buf_size, 10000);
        printf("\n");
    }
}
I am sharing the output from my google drive as stackoverflow is not allowing me to post images(says 10 reps needed for that)
Output for 1 to 256 KB:https://drive.google.com/file/d/0B3mnbsS6F4tpY2dhRWJLaEY1RWc/view?usp=sharing
output for 1 to 16384 KB:https://drive.google.com/file/d/0B3mnbsS6F4tpeC1Dd2R1VnJOV2c/view?usp=sharing
1) Why the graph has a peak @11-13KB?
2) why behavior from 20 to 129KB9(range1) and 130 to 256KB(range2) are different?(range1 has max speed not at multiples of 4 but range2 has max speed at multiples of 4; that too with large peaks; also range2 has better speed than range1 at multiples of 4)
3) Why the speed reduces dramatically close to 3000KB?
--Arun
 
     
    