Performance cost of accessing memory using calculated adresses (base + offset) vs register

Question

Is there any performance cost of accessing data by a calculted address like vmovupd ymm13, YMMWORD PTR [rbp+r14*8+78D0h] versus using an adress stored in a register like

vmovapd ymm13, YMMWORD PTR [rdi]

or vmovupd ymm0,ymmword ptr [r9] vs vmovupd ymm0,ymmword ptr [r9+60h]

More precisely: Does the arithmetic in [rbp+r14*8+78D0h] or [r9+60h] cost something and if so, what is the background?

Imagine a loop having a counter that serves as base offset per iteration for accessing various blocks of memory like this example in c.

for (uint64_t i = 0; i < n; i++)
{
    doSomethingWith (&data0[i],&otherData[i]);
    doSomethingDifferentWith (&data1[i+4],&otherData1[i+8]);
    doSomethingElseWith (&data2[i+8],&otherData2[i+4]);
}

This example produces that kind of offset like adressed. I wonder if it might be beneficial to iterate using stored adresses instead, which comes with the cost of extra instructions produced by pData0++; pOtherdata += 4; pData2 +=8; ... like lea, add, etc.

This is not about how to visualize effects using profilers. My aim is to understand the theory and mechanisms under the hood.

In such cases, the general answer is "profile it and find out". — Sebastian Redl, Feb 08 '22 at 08:08
@SebastianRedl profiling is something I do indeed to measure effects. My question is about to learn and understand the theory of mechanisms behind that lead to measurable effects. — Chris G., Feb 08 '22 at 08:20
An offset of `78D0h` doesn't fit into a signed byte offset so you need a dword offset in the machine encoding. This will, if nothing else, lengthen the instruction's encoding which can have effects on the instruction cache. — ecm, Feb 08 '22 at 08:37
Good catch @ecm, this was a bad example. I update with a better one. — Chris G., Feb 08 '22 at 08:42
You may want to have a peek at the `LEA` instruction - Load Effective Address. x86 has a pretty powerful address generator, which is so useful that many compilers will use it as as additional arithmetic unit. But of course, it can't do that at the same time as generating real addresses. — MSalters, Feb 08 '22 at 09:21

score 3 · Answer 1 · answered Feb 08 '22 at 09:19

3

The specifics depend on the microarchitecture of the processor you are programming for. Generally speaking, there is a penalty for using a SIB operand if all fields in the operand are filled in, i.e. if there is a base, an index, and a displacement. The penalty is 1 µop extra latency for computing the address.

Refer to Agner Fog's microarchitecture guide for a more detailed explanation.

answered Feb 08 '22 at 09:19

fuz

88,405
25
200
352

Are you talking about un-lamination on Intel CPUs to produce an extra uop? That happens if there are two registers, regardless of disp or not. [Micro fusion and addressing modes](https://stackoverflow.com/q/26046634) But IDK if that's what you mean because latency is measure in cycles, not uops. "base + disp" can be 1 cycle lower load-use latency on Intel CPUs, but with a penalty if the `+base` crosses a 4k boundary: [Is there a penalty when base+offset is in a different page than the base?](https://stackoverflow.com/q/52351397) – Peter Cordes Feb 08 '22 at 09:49
@PeterCordes Thanks, I might have remembered this wrongly. Please post a more comprehensive answer if possible. – fuz Feb 08 '22 at 09:50

Performance cost of accessing memory using calculated adresses (base + offset) vs register

1 Answers1