When running toplev, from pmu-tools on a piece of software (compiled with gcc: gcc -g -O3) I get this output:
FE             Frontend_Bound:                                          37.21 +-     0.00 % Slots                
BAD            Bad_Speculation:                                         23.62 +-     0.00 % Slots                
BE             Backend_Bound:                                            7.33 +-     0.00 % Slots below          
RET            Retiring:                                                31.82 +-     0.00 % Slots below          
FE             Frontend_Bound.Frontend_Latency:                         26.55 +-     0.00 % Slots                
FE             Frontend_Bound.Frontend_Bandwidth:                       10.62 +-     0.00 % Slots                
BAD            Bad_Speculation.Branch_Mispredicts:                      23.72 +-     0.00 % Slots                
BAD            Bad_Speculation.Machine_Clears:                           0.01 +-     0.00 % Slots below          
BE/Mem         Backend_Bound.Memory_Bound:                               1.59 +-     0.00 % Slots below          
BE/Core        Backend_Bound.Core_Bound:                                 5.73 +-     0.00 % Slots below          
RET            Retiring.Base:                                           31.54 +-     0.00 % Slots below          
RET            Retiring.Microcode_Sequencer:                             0.28 +-     0.00 % Slots below          
FE             Frontend_Bound.Frontend_Latency.ICache_Misses:            0.70 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Latency.ITLB_Misses:              0.62 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Latency.Branch_Resteers:          5.04 +-     0.00 % Clocks_Estimated      <==
FE             Frontend_Bound.Frontend_Latency.DSB_Switches:             0.57 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Latency.LCP:                      0.00 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Latency.MS_Switches:              0.76 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Bandwidth.MITE:                   0.36 +-     0.00 % CoreClocks below     
FE             Frontend_Bound.Frontend_Bandwidth.DSB:                   26.79 +-     0.00 % CoreClocks below     
FE             Frontend_Bound.Frontend_Bandwidth.LSD:                    0.00 +-     0.00 % CoreClocks below     
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound:                      6.53 +-     0.00 % Stalls below         
BE/Mem         Backend_Bound.Memory_Bound.L2_Bound:                     -0.03 +-     0.00 % Stalls below         
BE/Mem         Backend_Bound.Memory_Bound.L3_Bound:                      0.37 +-     0.00 % Stalls below         
BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound:                    2.46 +-     0.00 % Stalls below         
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound:                   0.22 +-     0.00 % Stalls below         
BE/Core        Backend_Bound.Core_Bound.Divider:                         0.01 +-     0.00 % Clocks below         
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization:              28.53 +-     0.00 % Clocks below         
RET            Retiring.Base.FP_Arith:                                   0.02 +-     0.00 % Uops below           
RET            Retiring.Base.Other:                                     99.98 +-     0.00 % Uops below           
RET            Retiring.Microcode_Sequencer.Assists:                     0.00 +-     0.00 % Slots_Estimated below
               MUX:                                                    100.00 +-     0.00 %                      
warning: 6 results not referenced: 67 71 72 85 87 88
This binary takes around 4.7 seconds to run.
If I add the following flag to gcc: -falign-loops=32, the binary takes now around 3.8 seconds to run, and this is the output from toplev:
FE             Frontend_Bound:                                          17.47 +-     0.00 % Slots below           
BAD            Bad_Speculation:                                         28.55 +-     0.00 % Slots                 
BE             Backend_Bound:                                           12.02 +-     0.00 % Slots                 
RET            Retiring:                                                34.21 +-     0.00 % Slots below           
FE             Frontend_Bound.Frontend_Latency:                          6.10 +-     0.00 % Slots below           
FE             Frontend_Bound.Frontend_Bandwidth:                       11.31 +-     0.00 % Slots below           
BAD            Bad_Speculation.Branch_Mispredicts:                      29.19 +-     0.00 % Slots                  <==
BAD            Bad_Speculation.Machine_Clears:                           0.01 +-     0.00 % Slots below           
BE/Mem         Backend_Bound.Memory_Bound:                               4.58 +-     0.00 % Slots below           
BE/Core        Backend_Bound.Core_Bound:                                 7.44 +-     0.00 % Slots below           
RET            Retiring.Base:                                           33.70 +-     0.00 % Slots below           
RET            Retiring.Microcode_Sequencer:                             0.50 +-     0.00 % Slots below           
FE             Frontend_Bound.Frontend_Latency.ICache_Misses:            0.55 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Latency.ITLB_Misses:              0.58 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Latency.Branch_Resteers:          5.72 +-     0.00 % Clocks_Estimated below
FE             Frontend_Bound.Frontend_Latency.DSB_Switches:             0.17 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Latency.LCP:                      0.00 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Latency.MS_Switches:              0.40 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Bandwidth.MITE:                   0.68 +-     0.00 % CoreClocks below      
FE             Frontend_Bound.Frontend_Bandwidth.DSB:                   42.01 +-     0.00 % CoreClocks below      
FE             Frontend_Bound.Frontend_Bandwidth.LSD:                    0.00 +-     0.00 % CoreClocks below      
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound:                      7.60 +-     0.00 % Stalls below          
BE/Mem         Backend_Bound.Memory_Bound.L2_Bound:                     -0.04 +-     0.00 % Stalls below          
BE/Mem         Backend_Bound.Memory_Bound.L3_Bound:                      0.70 +-     0.00 % Stalls below          
BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound:                    0.71 +-     0.00 % Stalls below          
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound:                   1.85 +-     0.00 % Stalls below          
BE/Core        Backend_Bound.Core_Bound.Divider:                         0.02 +-     0.00 % Clocks below          
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization:              17.38 +-     0.00 % Clocks below          
RET            Retiring.Base.FP_Arith:                                   0.02 +-     0.00 % Uops below            
RET            Retiring.Base.Other:                                     99.98 +-     0.00 % Uops below            
RET            Retiring.Microcode_Sequencer.Assists:                     0.00 +-     0.00 % Slots_Estimated below 
               MUX:                                                    100.00 +-     0.00 %                       
warning: 6 results not referenced: 67 71 72 85 87 88
By adding that flag, the Frontend Latency has improved (as we can see from the toplev output). I understand that by adding that flag, now the loops are aligned to 32 bytes, and the DSB is hit more frequently when running tight loops (the code spends its time mostly in a couple of small loops). However I don't understand why the metric Frontend_Bound.Frontend_Bandwidth.DSB has gone up (the description for that metric is: "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline"). I would have expected that metric to go down, as the use of the DSB is precisely what I'm improving by adding the gcc flag.
PS: when running toplev I've used --no-multiplex, to minimize errors caused by multiplexing. The target architecture is Broadwell, and the assembly of the loops is the following (Intel syntax):
 606:   eb 15                   jmp    61d <main+0x7d>
 608:   0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
 60f:   00 
 610:   48 83 c6 01             add    rsi,0x1
 614:   48 81 fe 01 20 00 00    cmp    rsi,0x2001
 61b:   74 ad                   je     5ca <main+0x2a>
 61d:   41 80 3c 30 00          cmp    BYTE PTR [r8+rsi*1],0x0
 622:   74 ec                   je     610 <main+0x70>
 624:   48 8d 0c 36             lea    rcx,[rsi+rsi*1]
 628:   48 81 f9 00 20 00 00    cmp    rcx,0x2000
 62f:   77 20                   ja     651 <main+0xb1>
 631:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
 636:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
 63d:   00 00 00 
 640:   41 c6 04 08 00          mov    BYTE PTR [r8+rcx*1],0x0
 645:   48 01 f1                add    rcx,rsi
 648:   48 81 f9 00 20 00 00    cmp    rcx,0x2000
 64f:   7e ef                   jle    640 <main+0xa0>
 
    