Let's consider the following function:
#include <stdint.h>
uint64_t foo(uint64_t x) { return x * 3; }
If I were to write it, I'd do
.global foo
.text
foo:
    imul %rax, %rdi, $0x3
    ret
On the other hand, the compiler generates two additions, with -O0:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   48 89 7d f8             mov    %rdi,-0x8(%rbp)
   8:   48 8b 55 f8             mov    -0x8(%rbp),%rdx
   c:   48 89 d0                mov    %rdx,%rax
   f:   48 01 c0                add    %rax,%rax
  12:   48 01 d0                add    %rdx,%rax
  15:   5d                      pop    %rbp
  16:   c3                      retq   
or lea with -O2:
0000000000000000 <foo>:
   0:   48 8d 04 7f             lea    (%rdi,%rdi,2),%rax
   4:   c3                      retq   
Why? Since every assembly instruction equals one processor clock tick, my version should run within 2 CPU clock cycles (since it has two instructions), in the -O0 we need 4 cycles for performing addition, because it could be rewritten to 
  mov    %rdi,%rax
  add    %rax,%rax
  add    %rdi,%rax
  retq
and the lea should take two cycles either.
 
    