Is there any performance cost of accessing data by a calculted address like vmovupd ymm13, YMMWORD PTR [rbp+r14*8+78D0h] versus using an adress stored in a register like
vmovapd ymm13, YMMWORD PTR [rdi]
or
vmovupd ymm0,ymmword ptr [r9] vs
vmovupd ymm0,ymmword ptr [r9+60h]
More precisely: Does the arithmetic in [rbp+r14*8+78D0h] or [r9+60h] cost something and if so, what is the background?
Imagine a loop having a counter that serves as base offset per iteration for accessing various blocks of memory like this example in c.
for (uint64_t i = 0; i < n; i++)
{
doSomethingWith (&data0[i],&otherData[i]);
doSomethingDifferentWith (&data1[i+4],&otherData1[i+8]);
doSomethingElseWith (&data2[i+8],&otherData2[i+4]);
}
This example produces that kind of offset like adressed.
I wonder if it might be beneficial to iterate using stored adresses instead, which comes with the cost of extra instructions produced by pData0++; pOtherdata += 4; pData2 +=8; ... like lea, add, etc.
This is not about how to visualize effects using profilers. My aim is to understand the theory and mechanisms under the hood.