The way PLT usage is specified in the SystemV ABI (and implemented in practice), is schematically somtehing like this:
# A call from somewhere in code is into a PLT slot
# (In reality not a direct call, in x64 typically an rip-relative one)
0x500:   
          call 0x1000   
...
0x1000:
   .PLT1: jmp [0x2000]  # the slot for f in the binary's GOT
          pushq $index_f
          jmp .PLT0
...
0x2000: 
# initially jumps back to .PLT to call the lazy-binding routine:
   .GOT1: 0x1005
# but after that is called:
          0x3000   # the address of the real implementation of f
...
0x3000:
     f:  ....
My question is:
isn't the 1st jmp in the PLT slot redundant? Couldn't this work with an indirect call into the GOT instead?  For example:
0x500:   
          call [0x2000]
...
0x1000:
   .PLT1: pushq $index_f
          jmp .PLT0
...
0x2000: 
# initially jumps back to .PLT to call the lazy-binding routine:
   .GOT1: 0x1005
# but after that is called:
          0x3000   # the address of the real implementation of f
...
0x3000:
     f:  ....
This might have marginal performance benefits - but the reason I'm asking is a recent scramble in the linkers/elf community to come up with extra bytes in a 16-byte PLT slot to accommodate intel IBT (the search failed, and resulted in an extra .plt.sec indirection. 1, 2)