Shorter x86 call instruction

Question

For context I am x86 golfing.

00000005 <start>:
   5:   e8 25 00 00 00          call   2f <cube>
   a:   50                      push   %eax

Multiple calls later...

0000002f <cube>:
  2f:   89 c8                   mov    %ecx,%eax
  31:   f7 e9                   imul   %ecx
  33:   f7 e9                   imul   %ecx
  35:   c3                      ret

call took 5 bytes even though the offset fit into a single byte! Is there any way to write call cube and assemble with GNU assembler and get a smaller offset? I understand 16 bit offsets could be used, but ideally I'd have a 2 byte instruction like call reg.

There is no 2-byte `call` equivalent to `jmp` in the current x86 instruction set. Alternatives (messing with the stack) would be as long or longer. — Jongware, Apr 06 '18 at 20:53
@usr2564301 could I move the address of my label into a register and use `FF` call? — qwr, Apr 06 '18 at 20:56
For a one-time use that would not be shorter either. You cannot encode a *relative* address in `call reg`, so loading the register itself would start off with the exact same length – and then you need to call it. If this jump occurs more, it may start paying off at the 5th or 6th call or so. — Jongware, Apr 06 '18 at 21:01
Yes, if you can generate a full address in a register in less than 3 bytes... Even a RIP-relative LEA doesn't help, because it only exists in `rel32` form, not `rel8`. Most OSes don't let you map anything in the lowest pages (so NULL-pointer deref faults), so usable addresses are larger than 16 bits, outside of 16-bit mode. — Peter Cordes, Apr 06 '18 at 21:01
@PeterCordes I am using this call multiple times, so maybe even if putting the address in the register takes several bytes, we save overall. Can you post as answer? — qwr, Apr 06 '18 at 21:06

Peter Cordes · Accepted Answer · 2023-01-26T17:11:09.713

There is no call rel8, or any way to push a return address and jmp in fewer than 5 bytes.

To come out ahead with call reg, you need to generate a full address in a register in less than 3 bytes. Even a RIP-relative LEA doesn't help, because it only exists in rel32 form, not rel8.
For a single call, clearly not worth it.

If you can reuse the same function pointer register for multiple 2-byte call reg instructions, then you come out ahead even with just 2 calls. (5 byte mov reg, imm32 plus 2x 2-byte call reg is a total of 9 bytes, vs. 10 for 2x 5-byte call). But it does cost you a register.

Most OSes don't let you map anything in the lowest pages (so NULL-pointer deref faults), so usable addresses are larger than 16 bits in 32 or 64-bit mode. 66 E8 rel16 (4 byte callw) isn't an option even in 32-bit mode; that would truncate EIP to IP. https://www.felixcloutier.com/x86/call

In 32-bit / 64-bit code, I'd consider the linker options necessary to get your code mapped in the zero page as part of the byte-count of your code-golf answer. (And also the /proc/sys/vm/mmap_min_addr kernel setting, or equivalent on other OSes) Normally we justify not counting the ELF metadata at all in code-golf, only bytes of the .text section, so special linker tricks opens up a can of worms there.

Generally avoid call in code-golf if you can. It's usually better to structure your loops to avoid needing code-reuse. e.g. jmp into the middle of a loop to get part of the loop to run the right number of times, instead of calling a block multiple times.

I guess I usually look at code-golf questions which lend themselves naturally to machine code, and can avoid needing the same block of code from multiple places. I can already spend hours tweaking a short function, so starting an answer to a question that will take more code (and thus have even more room for optimization between / across parts of it) is rare for me.

I think `mov`/`lea` label into register is what I'll go for (I wouldn't use `call` for a single call). I also really like the loop idea. — qwr, Apr 06 '18 at 21:32
@qwr: fun trick: you can skip the first 4 instruction bytes of a loop with 1 byte instead of a 2-byte `jmp`. Use `db 0x3D`, the opcode for `cmp eax, imm32` at the top of the loop. On loop entry, it will consume 4 bytes as an immediate. But when you branch back to the top of the loop, that's inside the bytes that were the immediate, so they run as instructions. — Peter Cordes, Apr 06 '18 at 21:35
I don't think I'm ready for executing immediates as code yet...! Though it probably is a technique important security-wise. — qwr, Apr 06 '18 at 22:02
@PeterCordes An old comment, but what you describe here is what I call [skipping instructions](https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code/235553#235553). Of course **you** are aware of that answer already but I thought it relevant here. — ecm, Jan 26 '23 at 18:25

Shorter x86 call instruction

1 Answers1

Linked