Vectorization: when is worth manually unrolling loops?

Question

I would like to have a general understanding of when I can expect a compiler to vectorize a loop and when it is worth for me to unroll the loop to help it decides to use vectorization.

I understand the details are very important (what compiler, what compilation options, what architecture, how do I write the code in the loop, etc), but I wonder if there are some general guidelines for modern compilers.

I will be more specific giving an example with a simple loop (the code is not supposed to compute anything useful):

    double *A,*B; // two arrays
    int delay = something
    [...]


    double numer = 0, denomB = 0, denomA = 0;
    for (int idxA = 0; idxA < Asize; idxA++)
    {
        int idxB = idxA + (Bsize-Asize)/2 + delay;
        numer  += A[idxA] * B[idxB];
        denomA += A[idxA] * A[idxA];
        denomB += B[idxB] * B[idxB];
    }

Can I expect a compiler to vectorize the loop or would it be useful to rewrite the code like the following?

    for ( int idxA = 0; idxA < Asize; idxA+=4 )
    {
        int idxB = idxA + (Bsize-Asize)/2 + delay;
        numer  += A[idxA] * B[idxB];
        denomA += A[idxA] * A[idxA];
        denomB += B[idxB] * B[idxB];

        numer  += A[idxA+1] * B[idxB+1];
        denomA += A[idxA+1] * A[idxA+1];
        denomB += B[idxB+1] * B[idxB+1];

        numer  += A[idxA+2] * B[idxB+2];
        denomA += A[idxA+2] * A[idxA+2];
        denomB += B[idxB+2] * B[idxB+2];

        numer  += A[idxA+3] * B[idxB+3];
        denomA += A[idxA+3] * A[idxA+3];
        denomB += B[idxB+3] * B[idxB+3];
    }

**measure**. identify possible bottlenecks. **measure**. change code/compilation options. **measure**. unless you **measure** no changes need to be made (and, most often, after **measure** you realize no changes need to be made). — pmg, May 06 '20 at 09:55
@pmg: thanks. So can I assume the answer is "no general guidelines" it really depends on the specific compiler+architecture+code+compilation options+etc ? — luca, May 06 '20 at 09:57
@luca yes, any a priori predictions are bound to fail since compiler has way too complex machinery making decisions.Just check it/measure it. It is pretty fail-proof — bartop, May 06 '20 at 09:57
modern compiler will use any available "trick" for a given architecture to optimize the code. I would say you should never unroll (unless you have the means to measure the impact) because you might force the compiler to pick some less efficient optimization mean (eg simd or not,...) — Jean-Marc Volle, May 06 '20 at 09:58
other than "**measure**"? Correct, no guidelines in my understanding. — pmg, May 06 '20 at 09:59
Without non-associative math optimizations, neither version of your code will significantly profit from vectorization — chtz, May 06 '20 at 12:43
@chtz interesting comment, it looks like there is something I need to know more about vectorization. Could you elaborate a little bit more on that? — luca, Feb 15 '21 at 10:49

score 2 · Accepted Answer · answered May 06 '20 at 11:32

Short answer, as others said : there is no general guidelines if you do not specify compiler nor target architecture.

As a remark, it is generally better to let the compiler optimize the code these days because it "knows" better the architecture possibilities. There is some cases where unrolling the loops will not be faster.

If someone see this and need it, there is the -funroll-loops flag in GCC.

score 0 · Answer 2 · answered Feb 15 '21 at 12:04

I gather from the other answers and comments that it is not advisable to manually unroll loops: the compiler knows better.

However the compiler might fail to vectorize your code depending on the optimization options used in compilation. Why? Because floating points addition and multiplication are neither associative nor commutative. This prevents the compiler from reordering operands, which in turn prevents vectorization in certain scenarios where you expect your code to be vectorized.

Vectorization: when is worth manually unrolling loops?

2 Answers2