I'm attempting to use vector intrinsics to speed up a trivial piece of code (as a test), and I'm not getting a speed up - in fact, it runs slower by a bit sometimes. I'm wondering two things:
- Do vectorized instructions speed up simple load from one region / store to another type operations in any way?
- Division intrinsics aren't yielding anything faster either, and in fact, I started getting segfaults when I introduced the _mm256_div_pdintrinsic. Is my usage correct?
constexpr size_t VECTORSIZE{ (size_t)1024 * 1024 * 64 }; //large array to force main memory accesses
void normal_copy(const fftw_complex* in, fftw_complex* copyto, size_t copynum)
{
    for (size_t i = 0; i < copynum; i++)
    {
        copyto[i][0] = in[i][0] / 128.0;
        copyto[i][1] = in[i][1] / 128.0;
    }
}
#if defined(_WIN32) || defined(_WIN64)
void avx2_copy(const fftw_complex* __restrict in, fftw_complex* __restrict copyto, size_t copynum)
#else
void avx2_copy(const fftw_complex* __restrict__ in, fftw_complex* __restrict__ copyto, size_t copynum)
#endif
{   //avx2 supports 256 bit vectorized instructions
    constexpr double zero = 0.0;
    constexpr double dnum = 128.0;
    __m256d tmp = _mm256_broadcast_sd(&zero);
    __m256d div = _mm256_broadcast_sd(&dnum);
    for (size_t i = 0; i < copynum; i += 2)
    {
        tmp = _mm256_load_pd(&in[i][0]);
        tmp = _mm256_div_pd(tmp, div);
        _mm256_store_pd(©to[i][0], tmp);
    }
}
int main()
{
    fftw_complex* invec   = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
    fftw_complex* outvec1 = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
    fftw_complex* outvec3 = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
    //some initialization stuff for invec
    //some timing stuff (wall clock)
    normal_copy(invec, outvec1, VECTORSIZE);
    //some timing stuff (wall clock)
    avx2_copy(invec, outvec3, VECTORSIZE);
    return 0;
}
fftw_complex is a datatype equivalent to std::complex.  I've tested using both g++ (with -O3 and -ftree-vectorize) on Linux, and Visual Studio on Windows - same results - AVX2 copy and div is slower and segfaults for certain array sizes.  Tested array sizes are always powers of 2, so anything related to reading invalid memory (from _mm256_load_pd) doesn't seem to be the issue.  Any thoughts?
 
    