I recommend that you do the vectorization manually. One reason is that auto-vectorization does not seem to handle carried loop dependencies well (loop unrolling).
To avoid code bloat and arcane intrinsics I use Agner Fog's vectorclass. In my experience it's just as fast as using intrinsics and it automatically takes advantage of SSE2-AVX2 (AVX2 is tested on a Intel emulator) depending on how you compile. I have written GEMM code using the vectorclass that works on SSE2 up to AVX2 and when I run on a system with AVX my code is already faster than Eigen which only uses SSE. Here is your function with the vectorclass (I did not try unrolling the loop).
#include "omp.h"
#include "math.h"
#include "vectorclass.h"
#include "vectormath.h"
void loop(const int H, const int W, const int outer_stride, float *A, float *B, float *C, float *D, float* in) {
#pragma omp parallel for
for (int j = 0; j < H*W; j+=8)//A,B,C,D,IN are __restricted, W*H must be a multiple of 8
{
Vec8f Gs = Vec8f().load(&D[j]) - Vec8f().load(&B[j]);
Vec8f Gc = Vec8f().load(&A[j]) - Vec8f().load(&C[j]);
Vec8f invec = atan(Gs, Gc);
invec.store(&in[j]);
}
}
When doing the vectorization yourself you have to be careful with array bounds. In the function above HW needs to be a multiple of 8. There are several solutions for that but the easiest and most efficient solution is to make the arrays (A,B,C,D,in) a bit larger (maximum 7 floats larger) if necessary to be a multiple of 8. However, another solution is to use the following code which does not require WH to be a multiple of 8 but it's not as pretty.
#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
void loop_fix(const int H, const int W, const int outer_stride, float *A, float *B, float *C, float *D, float* in) {
#pragma omp parallel for
for (int j = 0; j < ROUND_DOWN(H*W,8); j+=8)//A,B,C,D,IN are __restricted
{
Vec8f Gs = Vec8f().load(&D[j]) - Vec8f().load(&B[j]);
Vec8f Gc = Vec8f().load(&A[j]) - Vec8f().load(&C[j]);
Vec8f invec = atan(Gs, Gc);
invec.store(&in[j]);
}
for(int j=ROUND_DOWN(H*W,8); j<H*W; j++) {
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
in[j]=atan2f(Gs,Gc);
}
}
One challenge with doing the vectorization yourself is finding a SIMD math library (e.g. for atan2f). The vectorclass supports 3 options. Non-SIMD, LIBM by AMD, and SVML by Intel (I used the non-SIMD option in the code above).
SIMD math libraries for SSE and AVX
Some last comments you might want to consider. Visual Studio has auto-parallelization (off by default) as well as auto-vectorization (on by default, at least in release mode). You can try this instead of OpenMP to reduce code bloat.
http://msdn.microsoft.com/en-us/library/hh872235.aspx
Additionally, Microsoft has the parallel patterns library. It's worth looking into since Microsoft's OpenMP support is limited. It's nearly as easy as OpenMP to use. It's possible that one of these options works better with auto-vectorization (though I doubt it). Like I said, I would do the vectorization manually with the vectorclass.