I have a program that is almost spending all his time computing loops like
for(int j = 0; j < BIGNUMBER; j++)
    for(int i = 0; i < SMALLNUMBER; i++)
        result += var[i] / sqrt((A[i].x-B[j].x)*(A[i].x-B[j].x)+(A[i].y-B[j].y)*(A[i].y-B[j].y)+(A[i].z-B[j].z)*(A[i].z-B[j].z));
Where 1.0/sqrt(...) computes the inverse of the norm of the difference between the two vectors A[i] = {A[i].x, A[i].y, A[i].z} and B[j] = {B[j].x, B[j].y, B[j].z} which is also the most costly part of the loop.
Is there any way to optimize the loop, even with some precision loss?
Update:
Here the assembly code of an unvectorized loop step with the worse latency of every instruction. You clearly see that the inverse square root is the bottleneck:
movsd   A(%rip), %xmm0      1
movsd   A+8(%rip), %xmm2    1
subsd   B(%rip), %xmm0      3
subsd   B+8(%rip), %xmm2    3
movsd   A+16(%rip), %xmm1   1
mulsd   %xmm0, %xmm0        5
subsd   B+16(%rip), %xmm1   3
mulsd   %xmm2, %xmm2        5
mulsd   %xmm1, %xmm1        5
addsd   %xmm2, %xmm0        3
addsd   %xmm1, %xmm0        3
movsd   .LC0(%rip), %xmm1   1
unpcklpd    %xmm0, %xmm0    1
cvtpd2ps    %xmm0, %xmm0    4
unpcklps    %xmm0, %xmm0    3
cvtps2pd    %xmm0, %xmm0    2
sqrtsd  %xmm0, %xmm0        58
divsd   %xmm0, %xmm1        32
mulsd   var(%rip), %xmm1    5
addsd   result(%rip), %xmm1 3
cvttsd2si   %xmm1, %eax     3
movsd   %xmm1, result(%rip) 1
(By the way I don't understand why is it doing unpcklpd cvtpd2ps unpcklps cvtps2pd.)
 
     
     
     
     
     
    