I use the following two makefile to compile my program to do Gaussian blur.
- g++ -Ofast -ffast-math -march=native -flto -fwhole-program -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp
- g++ -O3 -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp
My two testing environments are:
- i7 4710HQ 4 cores 8 threads
- E5 2650
However, the first output has 2x speed on E5 but 0.5x speed on i7. The second output behaves faster on i7 but slower on E5.
Can anyone give some explanations?
this is the source code: https://github.com/makeapp007/interpolateFloatImg
I will give out more details as soon as possible.
The program on i7 will be run on 8 threads. I did't know how many threads will this program generate on E5.
==== Update ====
I am the teammate of the original author on this project, and here are the results.
Arch-Lenovo-Y50 ~/project/ca/3/12 (git)-[master] % perf stat -d ./interpolateFloatImg lobby.bin out.bin 255 20
Kernel kernelSize  : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height  8533 = 172921245
Micro seconds: 211199093
Performance counter stats for './interpolateFloatImg lobby.bin out.bin 255 20':
1423026.281358      task-clock:u (msec)       #    6.516 CPUs utilized          
             0      context-switches:u        #    0.000 K/sec                  
             0      cpu-migrations:u          #    0.000 K/sec                  
         2,604      page-faults:u             #    0.002 K/sec                  
4,167,572,543,807      cycles:u                  #    2.929 GHz                      (46.79%)
6,713,517,640,459      instructions:u            #    1.61  insn per cycle           (59.29%)
725,873,982,404      branches:u                #  510.092 M/sec                    (57.28%)
23,468,237,735      branch-misses:u           #    3.23% of all branches          (56.99%)
544,480,682,764      L1-dcache-loads:u         #  382.622 M/sec                    (37.00%)
545,000,783,842      L1-dcache-load-misses:u   #  100.10% of all L1-dcache hits    (31.44%)
38,696,703,292      LLC-loads:u               #   27.193 M/sec                    (26.68%)
1,204,703,652      LLC-load-misses:u         #    3.11% of all LL-cache hits     (35.70%)
218.384387536 seconds time elapsed
And these are the results from the workstation:
workstation:~/mossCAP3/repos/liuyh1_liujzh/12$  perf stat -d ./interpolateFloatImg ../../../lobby.bin out.bin 255 20
Kernel kernelSize  : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height  8533 = 172921245
Micro seconds: 133661220
Performance counter stats for './interpolateFloatImg ../../../lobby.bin out.bin 255 20':
2035379.528531      task-clock (msec)         #   14.485 CPUs utilized          
         7,370      context-switches          #    0.004 K/sec                  
           273      cpu-migrations            #    0.000 K/sec                  
         3,123      page-faults               #    0.002 K/sec                  
5,272,393,071,699      cycles                    #    2.590 GHz                     [49.99%]
             0      stalled-cycles-frontend   #    0.00% frontend cycles idle   
             0      stalled-cycles-backend    #    0.00% backend  cycles idle   
7,425,570,600,025      instructions              #    1.41  insns per cycle         [62.50%]
370,199,835,630      branches                  #  181.882 M/sec                   [62.50%]
47,444,417,555      branch-misses             #   12.82% of all branches         [62.50%]
591,137,049,749      L1-dcache-loads           #  290.431 M/sec                   [62.51%]
545,926,505,523      L1-dcache-load-misses     #   92.35% of all L1-dcache hits   [62.51%]
38,725,975,976      LLC-loads                 #   19.026 M/sec                   [50.00%]
 1,093,840,555      LLC-load-misses           #    2.82% of all LL-cache hits    [49.99%]
140.520016141 seconds time elapsed
====Update==== the specification of the E5:
workstation:~$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
     20  Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
workstation:~$ dmesg | grep cache
[    0.041489] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
[    0.047512] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
[    0.050088] Mount-cache hash table entries: 65536 (order: 7, 524288 bytes)
[    0.050121] Mountpoint-cache hash table entries: 65536 (order: 7, 524288 bytes)
[    0.558666] PCI: pci_cache_line_size set to 64 bytes
[    0.918203] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    0.948808] xhci_hcd 0000:00:14.0: cache line size of 32 is not supported
[    1.076303] ehci-pci 0000:00:1a.0: cache line size of 32 is not supported
[    1.089022] ehci-pci 0000:00:1d.0: cache line size of 32 is not supported
[    1.549796] sd 4:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.552711] sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.552955] sd 6:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 
     
    