I try to measure cached / non cached memory access time and results confusing me.
Here is the code:
  1 #include <stdio.h>                                                              
  2 #include <x86intrin.h>                                                          
  3 #include <stdint.h>                                                             
  4                                                                                 
  5 #define SIZE 32*1024                                                            
  6                                                                                 
  7 char arr[SIZE];                                                                 
  8                                                                                 
  9 int main()                                                                      
 10 {                                                                               
 11     char *addr;                                                                 
 12     unsigned int dummy;                                                         
 13     uint64_t tsc1, tsc2;                                                        
 14     unsigned i;                                                                 
 15     volatile char val;                                                          
 16                                                                                 
 17     memset(arr, 0x0, SIZE);                                                     
 18     for (addr = arr; addr < arr + SIZE; addr += 64) {                           
 19         _mm_clflush((void *) addr);                                             
 20     }                                                                           
 21     asm volatile("sfence\n\t"                                                   
 22             :                                                                   
 23             :                                                                   
 24             : "memory");                                                        
 25                                                                                 
 26     tsc1 = __rdtscp(&dummy);                                                    
 27     for (i = 0; i < SIZE; i++) {                                                
 28         asm volatile (                                                          
 29                 "mov %0, %%al\n\t"  // load data                                
 30                 :                                                               
 31                 : "m" (arr[i])                                                  
 32                 );                                                              
 33                                                                                 
 34     }                                                                           
 35     tsc2 = __rdtscp(&dummy);                                                    
 36     printf("(1) tsc: %llu\n", tsc2 - tsc1);                                     
 37                                                                                 
 38     tsc1 = __rdtscp(&dummy);                                                    
 39     for (i = 0; i < SIZE; i++) {                                                
 40         asm volatile (                                                          
 41                 "mov %0, %%al\n\t"  // load data                                
 42                 :                                                               
 43                 : "m" (arr[i])                                                  
 44                 );                                                              
 45                                                                                 
 46     }                                                                           
 47     tsc2 = __rdtscp(&dummy);                                                    
 48     printf("(2) tsc: %llu\n", tsc2 - tsc1);                                     
 49                                                                                 
 50     return 0;                                                                   
 51 } 
the output:
(1) tsc: 451248
(2) tsc: 449568
I expected, that first value would be much larger because caches were invalidated by clflush in case (1).
Info about my cpu (Intel(R) Core(TM) i7 CPU Q 720 @ 1.60GHz) caches:
Cache ID 0:
- Level: 1
- Type: Data Cache
- Sets: 64
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 1:
- Level: 1
- Type: Instruction Cache
- Sets: 128
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 4
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 2:
- Level: 2
- Type: Unified Cache
- Sets: 512
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 262144 bytes (256 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 3:
- Level: 3
- Type: Unified Cache
- Sets: 8192
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 12
- Total Size: 6291456 bytes (6144 kb)
- Is fully associative: false
- Is Self Initializing: true
Code disassembly between two rdtscp instructions
  400614:       0f 01 f9                rdtscp 
  400617:       89 ce                   mov    %ecx,%esi
  400619:       48 8b 4d d8             mov    -0x28(%rbp),%rcx
  40061d:       89 31                   mov    %esi,(%rcx)
  40061f:       48 c1 e2 20             shl    $0x20,%rdx
  400623:       48 09 d0                or     %rdx,%rax
  400626:       48 89 45 c0             mov    %rax,-0x40(%rbp)
  40062a:       c7 45 b4 00 00 00 00    movl   $0x0,-0x4c(%rbp)
  400631:       eb 0d                   jmp    400640 <main+0x8a>
  400633:       8b 45 b4                mov    -0x4c(%rbp),%eax
  400636:       8a 80 80 10 60 00       mov    0x601080(%rax),%al
  40063c:       83 45 b4 01             addl   $0x1,-0x4c(%rbp)
  400640:       81 7d b4 ff 7f 00 00    cmpl   $0x7fff,-0x4c(%rbp)
  400647:       76 ea                   jbe    400633 <main+0x7d>
  400649:       48 8d 45 b0             lea    -0x50(%rbp),%rax
  40064d:       48 89 45 e0             mov    %rax,-0x20(%rbp)
  400651:       0f 01 f9                rdtscp
Looks like I'am missing / misunderstand something. Could you suggest?
 
    