How to hint OpenMP Stride?

Question

I am trying to understand the conceptual reason why OpenMP breaks loop vectorization. Also any suggestions for fixing this would be helpful. I am considering manually parallelizing this to fix this issue, but that would certainly not be elegant and result in a massive amount of code bloat, as my code consists of several such sections that lend themselves to vectorization and parallelization.

I am using

Microsoft (R) C/C++ Optimizing Compiler Version 17.00.60315.1 for x64

With OpenMP:

info C5002: loop not vectorized due to reason '502'

Without OpenMP:

info C5001: loop vectorized

The VS vectorization page says this error happens when:

Induction variable is stepped in some manner other than a simple +1

Can I force it to step in stride 1?

The loop

#pragma omp parallel for
for (int j = 0; j < H*W; j++)//A,B,C,D,IN are __restricted
{
    float Gs = D[j]-B[j];
    float Gc = A[j]-C[j];
    in[j]=atan2f(Gs,Gc);
}

Best Effort(?)

#pragma omp parallel
{// This seems to vectorize, but it still requires quite a lot of boiler code
    int middle = H*W/2;
    #pragma omp sections nowait
    {
        #pragma omp section
        for (int j = 0; j < middle; j++)
        {
            float Gs = D[j]-B[j];
            float Gc = A[j]-C[j];
            in[j]=atan2f(Gs,Gc);
        }
        #pragma omp section
        for (int j = middle; j < H*W; j++)
        {
            float Gs = D[j]-B[j];
            float Gc = A[j]-C[j];
            in[j]=atan2f(Gs,Gc);
        }
    }
}

I'm surprised that Visual Studio vectorizes this due to the atan2f function. I have not tried to compile it with Visual Studio yet but with GCC it does not vectorize (with or without OpenMP). In my experience GCC auto-vectorizes better than Visual Studio. The bounty question you asked recently would not been have as interesting if you had used GCC because GCC hat no trouble vectorizing the loop with shorts. But maybe this is an example where Visual Studio's auto-vectorization is better. — , May 13 '13 at 17:49
I tried this in Visual Studio. It vectorizes like you said. I'm really surprised by this. I have not tested the performance. I wonder what Visual Studio is doing for the atan2f function. Does it really have a SSE/AVX atan2f function? — , May 14 '13 at 12:48

score 2 · Accepted Answer · edited May 23 '17 at 11:55

I recommend that you do the vectorization manually. One reason is that auto-vectorization does not seem to handle carried loop dependencies well (loop unrolling).

To avoid code bloat and arcane intrinsics I use Agner Fog's vectorclass. In my experience it's just as fast as using intrinsics and it automatically takes advantage of SSE2-AVX2 (AVX2 is tested on a Intel emulator) depending on how you compile. I have written GEMM code using the vectorclass that works on SSE2 up to AVX2 and when I run on a system with AVX my code is already faster than Eigen which only uses SSE. Here is your function with the vectorclass (I did not try unrolling the loop).

#include "omp.h"
#include "math.h"

#include "vectorclass.h"
#include "vectormath.h"

void loop(const int H, const int W, const int outer_stride, float *A, float *B, float *C, float *D, float* in) {
    #pragma omp parallel for
    for (int j = 0; j < H*W; j+=8)//A,B,C,D,IN are __restricted, W*H must be a multiple of 8
    {
        Vec8f Gs = Vec8f().load(&D[j]) - Vec8f().load(&B[j]);
        Vec8f Gc = Vec8f().load(&A[j]) - Vec8f().load(&C[j]);
        Vec8f invec = atan(Gs, Gc);
        invec.store(&in[j]);
    }

}

When doing the vectorization yourself you have to be careful with array bounds. In the function above HW needs to be a multiple of 8. There are several solutions for that but the easiest and most efficient solution is to make the arrays (A,B,C,D,in) a bit larger (maximum 7 floats larger) if necessary to be a multiple of 8. However, another solution is to use the following code which does not require WH to be a multiple of 8 but it's not as pretty.

#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
void loop_fix(const int H, const int W, const int outer_stride, float *A, float *B, float *C, float *D, float* in) {
    #pragma omp parallel for
    for (int j = 0; j < ROUND_DOWN(H*W,8); j+=8)//A,B,C,D,IN are __restricted
    {
        Vec8f Gs = Vec8f().load(&D[j]) - Vec8f().load(&B[j]);
        Vec8f Gc = Vec8f().load(&A[j]) - Vec8f().load(&C[j]);
        Vec8f invec = atan(Gs, Gc);
        invec.store(&in[j]);
    }
    for(int j=ROUND_DOWN(H*W,8); j<H*W; j++) {
        float Gs = D[j]-B[j];
        float Gc = A[j]-C[j];
        in[j]=atan2f(Gs,Gc);
    }

}

One challenge with doing the vectorization yourself is finding a SIMD math library (e.g. for atan2f). The vectorclass supports 3 options. Non-SIMD, LIBM by AMD, and SVML by Intel (I used the non-SIMD option in the code above). SIMD math libraries for SSE and AVX

Some last comments you might want to consider. Visual Studio has auto-parallelization (off by default) as well as auto-vectorization (on by default, at least in release mode). You can try this instead of OpenMP to reduce code bloat. http://msdn.microsoft.com/en-us/library/hh872235.aspx

Additionally, Microsoft has the parallel patterns library. It's worth looking into since Microsoft's OpenMP support is limited. It's nearly as easy as OpenMP to use. It's possible that one of these options works better with auto-vectorization (though I doubt it). Like I said, I would do the vectorization manually with the vectorclass.

score 1 · Answer 2 · edited May 23 '17 at 12:03

1

You may try loop unrolling instead of sections:

#pragma omp parallel for
for (int j = 0; j < H*W; j += outer_stride)//A,B,C,D,IN are __restricted
{
  for (int ii = 0; ii < outer_stride; ii++) {
    float Gs = D[j+ii]-B[j+ii];
    float Gc = A[j+ii]-C[j+ii];
    in[j+ii] = atan2f(Gs,Gc);
  }
}

where outer_stride is a suitable multiple of your SIMD line. Also, you may find this answer useful.

edited May 23 '17 at 12:03

Community

1
1

answered May 13 '13 at 14:40

Massimiliano

7,842
2
47
62

I like your suggestion. I tried it in GCC but it does not vectorize at all due to the atan2f function. I'm surprised that GCC does not vectorize while Visual Studio does. I compiled with "g++ foo.cpp -o foo -O3 -fopenmp -ftree-vectorizer-verbose=2 -ffast-math" – May 13 '13 at 17:45
I think loop unrolling causes the same troubles that the original loop had. Specifically, VS isn't sure of the stride, and from the performance it seems the loops were not vectorized. – Mikhail May 13 '13 at 22:36
I tried your suggestion in VS as well and it does not vectorize. Actually, I'm surprised the original loop vectorizes. I would be really surprised if Visual Studio has a SSE/AVX version of atan2f. – May 14 '13 at 13:03

How to hint OpenMP Stride?

2 Answers2