I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop:
BITS 64
GLOBAL _start
SECTION .text
_start:
 mov ecx, 10000000
.loop:
 dec ecx             ;|
  jz .end            ;| 1 uOP (call it D)
jmp .loop            ;| 1 uOP (call it J)
.end:
 mov eax, 60
 xor edi, edi
 syscall
Using perf we see that the loop runs at 1c/iter
Performance counter stats for './main' (50 runs):
        10,001,055      uops_executed_port_port_6   ( +-  0.00% )
         9,999,973      uops_executed_port_port_0   ( +-  0.00% )
        10,015,414      cycles:u                    ( +-  0.02% )
                23      resource_stalls_rs          ( +- 64.05% )
My interpretations of these results are:
- Both D and J are dispatched in parallel.
- J has a reciprocal throughput of 1 cycle.
- Both D and J are dispatched optimally.
However, we can also see that the RS never gets full.
It can dispatch uOPs at a rate of 2 uOPs/c at most but can theoretically get 4 uOPs/c, leading to a full RS in about 30 c (for an RS with a size of 60 fused-domain entries).  
To my understanding, there should be very few branch mispredictions and the uOPs should all come from the LSD.
So I looked at the FE:
     8,239,091      lsd_cycles_active ( +-  3.10% )
       989,320      idq_dsb_cycles    ( +- 23.47% )
     2,534,972      idq_mite_cycles   ( +- 15.43% )
         4,929      idq_ms_uops       ( +-  8.30% )
   0.007429733 seconds time elapsed   ( +-  1.79% )
which confirms that the FE is issuing from the LSD1.
However, the LSD never issues 4 uOPs/c:
     7,591,866      lsd_cycles_active ( +-  3.17% )
             0      lsd_cycles_4_uops 
My interpretation is that the LSD cannot issue uOPs from the next iteration2 thereby only sending D J pairs to the BE each cycle.
Is my interpretation correct? 
The source code is in this repository.
1 There is a bit of variance, I think this is due to the high number of iterations that allows for some context switch.
2 This is sound quite complex to do in hardware with limited circuits depth.  
 
    