Using Vector Intrinsics Yields Unexpected (Slow) Results

Question

I'm attempting to use vector intrinsics to speed up a trivial piece of code (as a test), and I'm not getting a speed up - in fact, it runs slower by a bit sometimes. I'm wondering two things:

Do vectorized instructions speed up simple load from one region / store to another type operations in any way?
Division intrinsics aren't yielding anything faster either, and in fact, I started getting segfaults when I introduced the _mm256_div_pd intrinsic. Is my usage correct?

constexpr size_t VECTORSIZE{ (size_t)1024 * 1024 * 64 }; //large array to force main memory accesses

void normal_copy(const fftw_complex* in, fftw_complex* copyto, size_t copynum)
{
    for (size_t i = 0; i < copynum; i++)
    {
        copyto[i][0] = in[i][0] / 128.0;
        copyto[i][1] = in[i][1] / 128.0;
    }
}

#if defined(_WIN32) || defined(_WIN64)
void avx2_copy(const fftw_complex* __restrict in, fftw_complex* __restrict copyto, size_t copynum)
#else
void avx2_copy(const fftw_complex* __restrict__ in, fftw_complex* __restrict__ copyto, size_t copynum)
#endif
{   //avx2 supports 256 bit vectorized instructions
    constexpr double zero = 0.0;
    constexpr double dnum = 128.0;
    __m256d tmp = _mm256_broadcast_sd(&zero);
    __m256d div = _mm256_broadcast_sd(&dnum);
    for (size_t i = 0; i < copynum; i += 2)
    {
        tmp = _mm256_load_pd(&in[i][0]);
        tmp = _mm256_div_pd(tmp, div);
        _mm256_store_pd(&copyto[i][0], tmp);
    }
}

int main()
{
    fftw_complex* invec   = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
    fftw_complex* outvec1 = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
    fftw_complex* outvec3 = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));

    //some initialization stuff for invec

    //some timing stuff (wall clock)
    normal_copy(invec, outvec1, VECTORSIZE);

    //some timing stuff (wall clock)
    avx2_copy(invec, outvec3, VECTORSIZE);

    return 0;
}

fftw_complex is a datatype equivalent to std::complex. I've tested using both g++ (with -O3 and -ftree-vectorize) on Linux, and Visual Studio on Windows - same results - AVX2 copy and div is slower and segfaults for certain array sizes. Tested array sizes are always powers of 2, so anything related to reading invalid memory (from _mm256_load_pd) doesn't seem to be the issue. Any thoughts?

Seg fauls are almost certainly alignment due to use of `_mm256_load_pd` instead of `_mm256_loadu_pd`. Simple loops like `normal_copy` are probably handled well by your compiler. And it may well have noticed the `/128.0` is a nice power of two and turned it into a multiply by the reciprocal. — bg2b, Mar 04 '22 at 00:37
The `fftw_malloc` call is supposed to handle the alignment. I'll try manually specifying an alignment... — TomAdo, Mar 04 '22 at 00:48
But by and large, you think the other loop is fast due to compiler optimizations, and so there's not much vectorized instructions can do for me? — TomAdo, Mar 04 '22 at 00:49
As @bg2b assumed, your loop is simple enough for a compiler to vectorize and to replace the division by a multiplication: https://godbolt.org/z/Po5ndqjez (slightly simplified) — chtz, Mar 04 '22 at 07:58
@chtz So, do vectorized loads followed by stores (no FP ops) help with more complicated codes? — TomAdo, Mar 04 '22 at 15:33
It is hard to say in general what is "too complicated" for a (modern) compiler to optimize as good as you can implement it manually. Just have a look at the code your compiler generates and think about whether you can optimize this better (afterwards benchmark, of course). — chtz, Mar 04 '22 at 16:30

score 1 · Answer 1 · answered Mar 06 '22 at 00:35

Put it shortly: using SIMD instructions does not help much here except for the use of non-temporal stores.

Do vectorized instructions speed up simple load from one region / store to another type operations in any way?

This is dependent of the type of data that is copied and the target processor as well as the target RAM used. That being said, in your case, a modern x86-64 processor should nearly saturate the memory hierarchy with a scalar code because modern processors can both load and store 8-bytes in parallel per cycle and most processor are working at least at 2.5 GHz. This means 37.2 GiB/s for a core at this minimum frequency. While this is generally not enough to saturate the L1 or L2 cache, this is enough to saturate the RAM of most PC.

In practice, this is significantly more complex and the saturation is clearly underestimated. Indeed, Intel x86-64 processors and AMD Zen ones use a write allocate cache policy that cause written cache lines to be read first from the memory before being written back. This means that the actual throughput would be 37.2*1.5 = 56 GiB/s. This is not enough: even if the RAM would be able to support such a high throughput, cores often cannot because of the very high latency of the RAM compared to the size of the cache and the capability of hardware prefetchers (see this related post for more information). To reduce the wasted memory througput and so increase the real throughput, you can use non-temporal streaming instructions (aka. NT stores) like _mm256_stream_pd. Note that such an instruction require the data pointer to be aligned.

Note that NT store are only useful for data that are not directly reused or that are to big to fit in caches. Note also that memcpy should use NT-stores on x86-64 processor on relatively big input data. Note also that working in-place does not cause any issue due to the write allocate policy.

Division intrinsics aren't yielding anything faster either, and in fact, I started getting segfaults when I introduced the _mm256_div_pd intrinsic. Is my usage correct?

Because of the possible address misalignment (mentioned in the comments), you need to use a scalar loop to operate on some items until the address is aligned. As also mentioned in the comment, using a multiplication (_mm256_mul_pd) by 1./128. is much more efficient. The multiplication adds some latency but does not impact the throughput.

PS: do not forget to free the allocated memory.

Using Vector Intrinsics Yields Unexpected (Slow) Results

1 Answers1