I found in online resource that IvyBridge has 3 ALU. So I write a small program to test:
global _start
_start:
    mov rcx,    10000000
.for_loop:              ; do {
    inc rax
    inc rbx
    dec rcx
    jnz .for_loop       ; } while (--rcx)
    xor rdi,    rdi
    mov rax,    60      ; _exit(0)
    syscall
I compile and run it with perf:
$ nasm -felf64 cycle.asm && ld cycle.o && sudo perf stat ./a.out
The output shows:
10,491,664      cycles
which seems to make sense at the first glance, because there are 3 independent instructions (2 inc and 1 dec) that uses ALU in the loop, so they count 1 cycle together. 
But what I don't understand is why the whole loop only has 1 cycle? jnz depends on the result of dec rcx, it should counts 1 cycle, so that the whole loop is 2 cycle. I would expect the output to be close to 20,000,000 cycles.
I also tried to change the second inc from inc rbx to inc rax, which makes it dependent on the first inc. The result does becomes close to 20,000,000 cycles, which shows that dependency will delay an instruction so that they can't run at the same time. So why jnz is special?
What I'm missing here?
 
    