I have run a single thread matrix multiplication on a 4-core Intel CPU (1 thread per core), but the numbers from perf doesn't make sense.
 Performance counter stats for 'system wide':
    31,728,397,287      cpu-cycles                #    0.462 GHz                    
   131,661,730,104      ref-cycles                # 1916.425 M/sec                  
         68,701.58 msec cpu-clock                 #    4.000 CPUs utilized          
         68,701.90 msec task-clock                #    4.000 CPUs utilized          
    31,728,553,882      cpu/cpu-cycles/           #  461.830 M/sec                  
      17.176244725 seconds time elapsed
I have set the cpu frequency to minimum and watched it so, all cores were running at 800MHz. That means 1 cycle is 1.25ns. With the total cpu cycles 31,728,397,287 the execution time should be 39.66 seconds, but the run time is 17.1 seconds.
I also don't know why 0.462 GHz is written in front of cpu-cycles.
More information about the processor:
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           94
Model name:                      Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
Stepping:                        3
CPU MHz:                         800.022
CPU max MHz:                     3900,0000
CPU min MHz:                     800,0000
Any thoughts about that?
UPDATE:
I rerun the experiment with root access and specifying the user code.
# perf stat -a -e cycles:u,cycles,cpu-clock  ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 15.4839418610
MM execution Time in seconds: 15.3758427450
 Performance counter stats for 'system wide':
    14,237,521,876      cycles:u                  #    0.230 GHz                    
    17,470,220,108      cycles                    #    0.282 GHz                    
         61,974.41 msec cpu-clock                 #    4.000 CPUs utilized          
      15.494002570 seconds time elapsed
As you can see the frequency is still not 800MHz.  However, if I don't specify -a the result makes sense because cycles:u * (1/800MHz) is nearly nearly the same as elapsed time.
# perf stat -e cycles:u,cycles,cpu-clock  ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 16.5347361100
MM execution Time in seconds: 16.4267430900
 Performance counter stats for './mm_double_omp 1':
    13.135.516.694      cycles:u                  #    0,794 GHz                    
    13.201.778.987      cycles                    #    0,798 GHz                    
         16.541,22 msec cpu-clock                 #    1,000 CPUs utilized          
      16,544487905 seconds time elapsed
      16,522146000 seconds user
       0,019997000 seconds sys
 
    