I don't understand why peephole optimization is needed? Because the compiler is smart enough to optimise the code? Can you please give me some examples where peephole optimization is needed?
1 Answers
Peepholes are often target-specific.
They may only make sense in terms of target registers (RTL), not IR.
For example e.g. x86 xor eax, eax instead of mov eax,0. (What is the best way to set a register to zero in x86 assembly: xor, mov or and?).  There'd be no reason to do this in IR and doing it any earlier than the last moment (final code-generation) would obfuscate the fact that the value is zero for other optimizations.  Doing that for any machine except x86 would be an anti-optimization (creating a false dependency).  OTOH you don't want to leave it too late, or else you might not be able to reorder it ahead of something that sets FLAGS, e.g.
  xor  eax,eax
  cmp  ecx, edx
  sete al           ; boolean 0 or 1  zero-extended to 64-bit RAX
Instead of
  cmp   ecx, edx
  sete  al               ; false dependency on old RAX
  movzx eax, al          ; no mov-elimination, extra critical path latency
or
  cmp   ecx, edx
  mov   eax, 0          ; less efficient instruction to leave FLAGS untouched
  sete  al              ; later reads of RAX will have partial-register stalls on P6-family
Or as another example, x86 can multiply by 3, 5, or 9 using LEA to take advantage of the 2-bit shift and add in 2-register addressing-modes.  It might be useful for an optimizer to know that this is an efficient building-block, and aim to re-factor things into a multiply by 9, but actually converting a multiply by 10 into (x * 5) * 2 is not how you'd want to do it for targets where (x<<3) + (x<<1) is more efficient (x*10 = x*8 + x*2).
See
- Using LEA on values that aren't addresses / pointers?
- How to multiply a register by 37 using only 2 consecutive leal instructions in x86?  - shows how some compiles sometimes miss a peephole optimization, and discusses the tradeoff of imulvs. 2xleaand how modern CPUs with fastimulmake it only worth it to spend at most 2 instructions replacing a multiply, or only 1 if the bottleneck is throughput not latency. Unless you can fold an addition into it like LEA can...
 
    
    - 328,167
- 45
- 605
- 847
