MKL FFTW is Slower than FFTPACK for small data size

Question

I wrote a matrix computation C++ library 20 years ago and I’m willing to boost its performance using intel MKL library. For complex value vector/matrix, my library uses two split arrays: one for the real part and one for imaginary part.

Here are the timing results:

N=65536, fftw time = 0.005(s), fftpack time = 0.001(s)
N=100000, fftw time = 0.005(s), fftpack time = 0.003(s)
N=131072, fftw time = 0.006(s), fftpack time = 0.004(s)
N=250000, fftw time = 0.013(s), fftpack time = 0.007(s)
N=262144, fftw time = 0.012(s), fftpack time = 0.008(s)
N=524288, fftw time = 0.022(s), fftpack time = 0.018(s)
N=750000, fftw time = 0.037(s), fftpack time = 0.025(s)
N=1048576, fftw time = 0.063(s), fftpack time = 0.059(s)
N=1500000, fftw time = 0.114(s), fftpack time = 0.079(s)
N=2097152, fftw time = 0.126(s), fftpack time = 0.146(s)
N=4194304, fftw time = 0.241(s), fftpack time = 0.35(s)
N=8388608, fftw time = 0.433(s), fftpack time = 0.788(s)

For vectors with length < 1500000 double value fftpack is faster than fftw.

Here is the code I use:


Matrix X=randn(M,1); //input vector
//start timer
Matrix Y = MyFFTW(X);
// measure time

//function to compute the FFT
Matrix MyFFTW(Matrix X)
{
    int M= X.rows();
    int N= X.cols();
    Matrix Y(T_COMPLEX,M,N); // output complex to store FFT results
    // Input data could also ba matrix 
    double* in_data = (double*)fftw_malloc(sizeof(double) * M );
    fftw_complex* out_data = (fftw_complex*)fftw_malloc(sizeof(fftw_complex) * (M / 2 + 1));
    fftw_plan fftplan = fftw_plan_dft_r2c_1d(M, in_data, out_data, FFTW_ESTIMATE);
    //one fftplan is used for all the matrix columns
    for (int i = 1; i <= N; i++)
    {
        //copy column number i to in_dataused by the fftplan, arrays indexing is 1-based like matlab
        memcpy(in_data, X.pr(1,i), M* sizeof(double));
        fftw_execute(fftplan);
        //split out_data to real and imag parts
        double* pr = Y.pr(1,i), * pi = Y.pi(1,i);
        int k = (M - 1) / 2, j;
        for (j = 0; j <= k; j++)
        {
            *pr++ = out_data[j][0];
            *pi++ = out_data[j][1];
        }
        if (M % 2 == 0)
        {
            *pr++ = out_data[M/2][0];
            *pi++ = out_data[M/2][1];
        }
        for (j = k; j >= 1; j--)
        {   
            *pr++ = out_data[j][0];
            *pi++ = out_data[j][1];
        }
    }
    fftw_destroy_plan(fftplan);
    fftw_free(in_data);
    fftw_free(out_data);
    return Y;
}

Results are obtained on Intel core i7 @ 3.2 GHz using Visual Studio 2019 as compiler and the last intel MKL library. Compiler flags are:

/fp:fast /DWIN32 /O2 /Ot /Oi /Oy /arch:AVX2 /openmp /MD

Linker libs are:

mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

Is there a better way to make fftw faster for vector of small size ?

Update:

I tested against Matlab that uses MKL fftw for fft computation :

N=65536, matlab fft time = 0.071233(s)
N=100000, matlab fft time = 0.011437(s)
N=131072, matlab fft time = 0.0074411(s)
N=250000, matlab fft time = 0.015349(s)
N=262144, matlab fft time = 0.0082545(s)
N=524288, matlab fft time = 0.011395(s)
N=750000, matlab fft time = 0.022364(s)
N=1048576, matlab fft time = 0.019683(s)
N=1500000, matlab fft time = 0.033493(s)
N=2097152, matlab fft time = 0.035345(s)
N=4194304, matlab fft time = 0.069539(s)
N=8388608, matlab fft time = 0.1387(s)

Except the first call to fft with N=65536, Matlab(64bits) is faster than both my function (win32) using fftpack (for N > 500000) and using MKL fftw.

Thanks

Why copy data? You can get FFTW to work on the data in its original place. http://www.fftw.org/fftw3_doc/Guru-Real_002ddata-DFTs.html#Guru-Real_002ddata-DFTs — Cris Luengo, Jul 21 '20 at 07:04
Because the FFT will be done on the matrix column by column. I understand from the fftw documentation that a plan when created with `fftw_plan_dft_r2c_1d()` and a given input data pointer(in_data) can not be used with other pointer. — cairdac_rd, Jul 21 '20 at 19:10
Yes, it can. http://www.fftw.org/fftw3_doc/New_002darray-Execute-Functions.html#New_002darray-Execute-Functions — Cris Luengo, Jul 21 '20 at 19:12
Thank you for pointing me to these functions. I tested `fftw_execute_dft_r2c(fftplan, X.pr(1,i), out_data);` instead of `fftw_execute(fftplan);` The timing is the same. — cairdac_rd, Jul 21 '20 at 19:43

score 0 · Answer 1 · answered Jul 21 '20 at 06:44

0

regarding fftw, AFAIK, there are no specific performance tips from MKL which would help to accelerate the performance for small cases. Actually the overhead of using fftw from mkl is pretty negligible.
regarding your bench: I see you measure the allocation/deallocation parts, creating the fftw plans, memcopy operations as well. But, the only one routine (fftw_execute) from this benchmark are optimize by mkl. That's could be a problem with this pipeline. You could add the MKL_VERBOSE mode to check the execution time from fftw_execute...

answered Jul 21 '20 at 06:44

Gennady.F

571
2
7

Yes, but creating the plan is part of the job to be done. For my benchmark i compared all the time necessary to do the FFT and arranging the result data in the matrix. – cairdac_rd Jul 21 '20 at 19:15
I tested `set MKL_VERBOSE=1` that reports only the time spent by fftw_execute. For N = 800000 the fftw_execute time was 6 ms and the whole function time is 40 ms. – cairdac_rd Jul 21 '20 at 19:39

MKL FFTW is Slower than FFTPACK for small data size

1 Answers1