I have a number of tight loops I'm trying to optimize with GCC and intrinsics. Consider for example the following function.
void triad(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=0; i<n; i+=8) {
_mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
}
}
This produces a main loop like this
20: vmulps ymm0,ymm1,[rsi+rax*1]
25: vaddps ymm0,ymm0,[rdi+rax*1]
2a: vmovaps [rdx+rax*1],ymm0
2f: add rax,0x20
33: cmp rax,rcx
36: jne 20
But the cmp instruction is unnecessary. Instead of having rax start at zero and finish at sizeof(float)*n we can set the base pointers (rsi, rdi, and rdx) to the end of the array and set rax to -sizeof(float)*n and then test for zero. I am able to do this with my own assembly code like this
.L2 vmulps ymm1, ymm2, [rdi+rax]
vaddps ymm0, ymm1, [rsi+rax]
vmovaps [rdx+rax], ymm0
add rax, 32
jne .L2
but I can't manage to get GCC to do this. I have several tests now where this makes a significant difference. Until recently GCC and intrinsics have severed me well so I'm wondering if there is a compiler switch or a way to reorder/change my code so the cmp instruction is not produced with GCC.
I tried the following but it still produces cmp. All variations I have tried still produce cmp.
void triad2(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
float *x2 = x+n;
float *y2 = y+n;
float *z2 = z+n;
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=-n; i<0; i+=8) {
_mm256_store_ps(&z2[i], _mm256_add_ps(_mm256_load_ps(&x2[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y2[i]))));
}
}
Edit:
I'm interested in maximizing instruction level parallelism (ILP) for these functions for arrays which fit in the L1 cache (actually for n=2048). Although unrolling can be used to improve the bandwidth it can decrease the ILP (assuming the full bandwidth can be attained without unrolling).
Edit:
Here is a table of results for a Core2 (pre Nehalem), a IvyBridge, and a Haswell system. Intrinsics is the results of using intrinsics, unroll1 is my assembly code not using cmp, and unroll16 is my assembly code unrolling 16 times. The percentages are the percentage of the peak performance (frequency*num_bytes_cycle where num_bytes_cycle is 24 for SSE, 48 for AVX and 96 for FMA).
SSE AVX FMA
intrinsic 71.3% 90.9% 53.6%
unroll1 97.0% 96.1% 63.5%
unroll16 98.6% 90.4% 93.6%
ScottD 96.5%
32B code align 95.5%
For SSE I get almost as good a result without unrolling as with unroll but only if I don't use cmp. On AVX I get the best result without unrolling and without using cmp. It's interesting that on IB unrolling actually is worse. On Haswell I get by far the best result by unrolling. Which is why I asked this question. The source code to test this can be found in that question.
Edit:
Based on ScottD's answer I now get almost 97% with intrinsics for my Core2 system (pre Nehalem 64-bit mode). I'm not sure why the cmp matters actually since it should take 2 clock cycles per iteration anyway. For Sandy Bridge it turns out the efficiency loss is due to code alignment not to the extra cmp. On Haswell only unrolling works anyway.