OpenMP: why is my parallel code is SLOWER than serial code? using FFTW

Question

I have a code that uses FFT heavily, and a quick profiling shows indeed that the computer spends a significant portion of wall-clock time in my FFT functions. I have a simple attempt of using openMP here, however the code shows no speed up really compared to the series code. I am providing a simple example code here:

Main.cpp

static const int nx = 128;
static const int ny = 128; 
static const int nz = 128;

Eigen::initParallel();
Eigen::setNbThreads(5); 

Eigen::Tensor<double, 3> Re(nx,ny,nz); 
for(int i = 0; i< nx; i++){
        for(int j = 0; j< ny; j++){
            for(int k = 0; k< nz; k++){ 
                Re(k,i,j) = //Initialized w/ some function
            }
      }
}
Eigen::Tensor<std::complex<double>, 3> Im(nx,ny,nz); 

double start_time, run_time;
start_time = omp_get_wtime();
r2cfft3d(Re, Im); //function take real input and take forward fft to output complex result
run_time = omp_get_wtime() - start_time;
cout << "Time in s: " <<  run_time << "s\n";

The function looks like:

void r2cfft3d(Eigen::Tensor<double, 3>& rArr, Eigen::Tensor<std::complex<double>, 3>& cArr){    

    #pragma omp parallel 
    {
    fftw_complex *input_array;
    fftw_complex *output_array;

    input_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));
    output_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));

    #pragma omp parallel for
    for (int i = 0; i < nx; ++i) {

        for (int j = 0; j < ny; ++j) {

            for (int k = 0; k < nz; ++k) {
                {
                    input_array[k + nz * (j + ny * i)][REAL] = rArr(k,i,j);
                    input_array[k + nz * (j + ny * i)][IMAG] = 0;
                }
            }
        }
    }

    #pragma omp critical (make_plan)
    {
        fftw_plan forward = fftw_plan_dft_3d(nx, ny, nz, input_array, output_array, FFTW_FORWARD, FFTW_ESTIMATE);
        fftw_execute(forward);
        fftw_destroy_plan(forward);
        fftw_cleanup();
    }

    #pragma omp for nowait
     for(int i=0; i < nx; ++i){

        for(int j=0; j < ny; ++j) {

            for(int k=0; k < nz; ++k) {
                cArr(k,i,j).real(output_array[k + nz * (j + ny * i)][REAL]);
                cArr(k,i,j).imag(output_array[k + nz * (j + ny * i)][IMAG]);
                
            }

        }
    }
    
    fftw_free(input_array);
    fftw_free(output_array);

    }


}

The serial code returns an wall time of : Time in s: 0.026634s while the parallel code returns Time in s: 0.287648s

I am fairly new to OpenMP, and thought this would be more straightforward. Is there a better way to rewrite this function? Thanks!

EDIT: So, this is my second attempt, where the runtime has gotten better but not SIGNICANTLY better

void r2cfft3d(Eigen::Tensor<double, 3>& rArr, Eigen::Tensor<std::complex<double>, 3>& cArr){    

    #pragma omp parallel 
    {
    fftw_complex *input_array;
    fftw_complex *output_array;

    input_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));
    output_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));

    
    #pragma omp for
    for (int i = 0; i < nx; ++i) {

        for (int j = 0; j < ny; ++j) {

            for (int k = 0; k < nz; ++k) {
                {
                    input_array[k + nz * (j + ny * i)][REAL] = rArr(k,i,j);
                    input_array[k + nz * (j + ny * i)][IMAG] = 0;
                }
            }
        }
    }

    #pragma omp single //#pragma omp critical (make_plan)
    {
        fftw_plan forward = fftw_plan_dft_3d(nx, ny, nz, input_array, output_array, FFTW_FORWARD, FFTW_ESTIMATE);
        fftw_execute(forward);
        fftw_destroy_plan(forward);
        fftw_cleanup();
    }
    


    #pragma omp for nowait
     for(int i=0; i < nx; ++i){

        for(int j=0; j < ny; ++j) {

            for(int k=0; k < nz; ++k) {
                cArr(k,i,j).real(output_array[k + nz * (j + ny * i)][REAL]);
                cArr(k,i,j).imag(output_array[k + nz * (j + ny * i)][IMAG]);
                
            }

        }
    }

    
    fftw_free(input_array);
    fftw_free(output_array);
    
    }
  
    

}

As Jérôme Richard noted that the use nowait here might be a bad idea. I notice it reduces the runtime by few seconds which may has to do with the freeing of memory. Thanks all!

Edit 2: I rewrote the whole function and trying to make things more streamlined and avoid use of for loops. Now, for the serial code both of the attempts return same/similar serial time: Serial Time in SEC: 0.039189s However, by adding openmp parallelization attemp the time again is SLOWED down: Parallel Time in SEC: 0.0934717s

#pragma omp parallel
    {   

        fftw_complex *input_array;
        input_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));
        memcpy(input_array, rArr.data(), nx*ny*nz * sizeof(fftw_complex));

        fftw_complex *output_array;
        output_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));
        #pragma omp single
        {
            fftw_plan forward = fftw_plan_dft_3d(nx, ny, nz, input_array, output_array, FFTW_FORWARD, FFTW_ESTIMATE);
            fftw_execute(forward);
            fftw_destroy_plan(forward);
            fftw_cleanup();
        }
        
        memcpy(cArr.data(),output_array, nx*ny*nz * sizeof(fftw_complex));
        
        
        
        fftw_free(input_array);
        fftw_free(output_array);
    }
}

Edit 3: I finally used fftw parallel combined with OpenMP following the instructions from: https://www.fftw.org/fftw3_doc/Usage-of-Multi_002dthreaded-FFTW.html

Running with the flags: -lfftw3 -lfftw3_threads -fopenmp

Example

fftw_complex *input_array;
        input_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));


        
        memcpy(input_array, rArr.data(), nx*ny*nz * sizeof(fftw_complex));

        fftw_complex *output_array;
        output_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));

        fftw_init_threads();
        fftw_plan_with_nthreads(omp_get_max_threads());
        
        fftw_plan forward = fftw_plan_dft_3d(nx, ny, nz, input_array, output_array, FFTW_FORWARD, FFTW_ESTIMATE);
            

fftw_execute(forward);
fftw_destroy_plan(forward);
fftw_cleanup();

        
memcpy(cArr.data(),output_array, nx*ny*nz * sizeof(fftw_complex));
        
        
        
fftw_free(input_array);
fftw_free(output_array);

This returns a Parallel Time in SEC: 0.0281432s But now I am not sure if this is correct cause it seems too straightforward?

`input_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));` `output_array = (fftw_complex*) fftw_malloc(nx*ny*nz * sizeof(fftw_complex));` -- Do you really want to include the time of how long it takes to allocate (and deallocate) memory? What if you precreate these buffers, and pass them to the `r2cfft3d` function? — PaulMcKenzie, Feb 26 '23 at 05:27
@PaulMcKenzie I see, I mean I could try doing that. It's just that, I thought this looked more clean or less messy. Also, would this bit actually contribute to slowing down the parallel code compared to the serial? — Jamie, Feb 26 '23 at 06:12
Creating plans is typically more expensive than the FFT itself (depending on what flags you pass). Ideally create a plan once and use it many times, assuming you’re doing multiple FFTs of the same dimensions etc (otherwise use FFTW “wisdom”). — Paul R, Feb 26 '23 at 07:16
Your code practically runs serially because the main calculation is inside a critical section. However, I do not understand what you would like to parallelize here. Do you want to parallelize only the fill-up of input/output arrays and call `fftw_plan_dft_3` only once? Or do you wish to call the `fftw_plan_dft_3d` function in parallel with different input arrays? At the moment all threads call `fftw_plan_dft_3d` using the same input array, which makes no sense to me. Most probably this is the main reason of long runtime. Depending on your answer, your code will need to be corrected differently — Laci, Feb 26 '23 at 07:54
`#pragma omp parallel` use N threads so the section is executed N times by N different threads. This means you allocate N time more data than needed (which is really critical on 64 core machines). `#pragma omp parallel for` is equivalent to `#pragma omp parallel` + `#pragma omp for`. This means you have 2 nested parallel section. Nesting is clearly inefficient. You should not use it except in very few rare cases. The `nowait` in the last `#pragma omp for` is really bad idea: you basically tell OpenMP that the memory of `output_array` can be freed why it is read. This can cause a *segfault*! — Jérôme Richard, Feb 26 '23 at 14:12
You have malloc's inside a parallel region, which means that each thread gets its own copy of the full array. So if you have 8 cores this probably means you're doing 8 identical FFTs in parallel. — Victor Eijkhout, Feb 26 '23 at 15:44
Note also that FFTW supports parallel operation on multiple cores internally via threads, so there is no real need for omp: http://www.fftw.org/fftw3_doc/Multi_002dthreaded-FFTW.html — Paul R, Feb 26 '23 at 16:40
@Laci Thanks, made few changes that reduced the runtime to ``Time in s: 0.0471485s`` which is still slower. I guess because I am using this function in various several places in my codes including inside of other function. Maybe the best option is to call fftw plan function in parallel with different input arrays? I will add my edited code as well. — Jamie, Feb 26 '23 at 20:42
@JérômeRichard Thanks for your answer, things are definitely making more sense now. I see that ``#pragma omp parallel`` maybe not the best option here so I changed that and it fixed things (slightly better). But, I thought the ``nowait`` was helping here. I will post another updated version of the code as well. — Jamie, Feb 26 '23 at 20:52
@PaulR I saw that, the thing is my entire code uses Eigen and I thought openmp might be the best option here. Plus it's supposed to be easier? Not sure if using the fftw multithreaded option here along with openmp for other parts of the code may cause issues or conflicts. — Jamie, Feb 26 '23 at 20:57
1) You should profile your code an find the bottleneck. If it is FFTW calls, which is - as pointed out by Paul R - is a parallel library, forget any further parallelization. 2) At the moment there is no parallel region in your updated code, so you are not using OpenMP at all. 3) To remove `nowait` is not optional, without it your code is flawed. 4) You should also consider PaulMcKenzie's suggestion. — Laci, Feb 27 '23 at 06:32
As long as you take care to not call FFTW with multiple threads in parallel, using parallel FFTW **and** OpenMP for your own code should be no problem at all. I.e. avoid nested parallelism. Ideally you should use a build of FFTW that uses OMP for parallelism as well (there are multiple different threading options/implementations for FFTW). — paleonix, Feb 28 '23 at 00:24
@paleonix Thanks for the great feedback. So, in an attempt to avoid nested parallelsim. I rewrote my function as shown in my new edit sort of to make it a lot simpler. I used openMP and I am getting some speed up but not so significant. Ideally, I am trying to push for twice speed up. I am trying to test with parallel FFTW as well with openMP but struggling to find an applicable example so far — Jamie, Apr 13 '23 at 21:20
I would hazard a guess that the threads are all accessing the same cache lines, so you'll be running into "false sharing". You would need to partition the input and output arrays such that different threads access different blocks with absolutely minimal overlap. Aligning the arrays to cache-line boundaries (64 bytes typ') will also help. — Den-Jason, Apr 13 '23 at 21:27
@Den-Jason "You would need to partition the input and output arrays such that different threads access different blocks with absolutely minimal overlap" Mm I am not sure how I would approach that. — Jamie, Apr 13 '23 at 21:29
Also take a look at https://cs.wmich.edu/gupta/teaching/cs5260/5260Sp15web/studentProjects/tiba&hussein/03278999.pdf There are some references that might be useful, e.g. "Quinn, M.J., (2004). Parallel programming in C with MPI and OpenMP" — Den-Jason, Apr 13 '23 at 21:32
In terms of understanding false sharing, see threads such as this - https://stackoverflow.com/questions/8331255/false-sharing-and-pthreads — Den-Jason, Apr 13 '23 at 21:34
Also Chandler Carruth (or Matt Godbolt) did a CppCon talk where he demonstrated that for small datasets, bubble sort is the quickest algorithm - due to the absence of pointer chasing. e.g. https://www.youtube.com/watch?v=fHNmRkzxHWs — Den-Jason, Apr 13 '23 at 21:40
Did end up using an FFTW combined with openMP example which seems to speed up the code, but I am not sure how — Jamie, Apr 14 '23 at 02:35

OpenMP: why is my parallel code is SLOWER than serial code? using FFTW

0 Answers0