I have what seems to be a very simple parallel for loop, which is just writing zeros to an integer array. But it turns out the more threads, the slower the loop gets. I thought that this was due to some cache thrashing so I played around with schedules, chunk size, __restrict__, nesting the parallel for inside a parallel block, and flushes. Then I noticed that reading an array to do a reduction is also slower.
This should be obviously very simple and should speed up nearly linearly. What am I missing here?
Full code:
#include <omp.h>
#include <vector>
#include <iostream>
#include <ctime>
void tic(), toc();
int main(int argc, const char *argv[])
{
const int COUNT = 100;
const size_t sz = 250000 * 200;
std::vector<int> vec(sz, 1);
std::cout << "max threads: " << omp_get_max_threads()<< std::endl;
std::cout << "serial reduction" << std::endl;
tic();
for(int c = 0; c < COUNT; ++ c) {
double sum = 0;
for(size_t i = 0; i < sz; ++ i)
sum += vec[i];
}
toc();
int *const ptr = vec.data();
const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int
std::cout << "parallel reduction" << std::endl;
tic();
for(int c = 0; c < COUNT; ++ c) {
double sum = 0;
#pragma omp parallel for default(none) reduction(+:sum)
for(int i = 0; i < sz_i; ++ i)
sum += ptr[i];
}
toc();
std::cout << "serial memset" << std::endl;
tic();
for(int c = 0; c < COUNT; ++ c) {
for(size_t i = 0; i < sz; ++ i)
vec[i] = 0;
}
toc();
std::cout << "parallel memset" << std::endl;
tic();
for(int c = 0; c < COUNT; ++ c) {
#pragma omp parallel for default(none)
for(int i = 0; i < sz_i; ++ i)
ptr[i] = 0;
}
toc();
return 0;
}
static clock_t ClockCounter;
void tic()
{
ClockCounter = std::clock();
}
void toc()
{
ClockCounter = std::clock() - ClockCounter;
std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl;
}
Running this yields:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 12
serial reduction
elapsed clock ticks: 1790000
parallel reduction
elapsed clock ticks: 19690000
serial memset
elapsed clock ticks: 3860000
parallel memset
elapsed clock ticks: 20800000
If I run with -O2, g++ is able to optimize the serial reduction away and I get zero time, thus -O1. Also, putting omp_set_num_threads(1); makes the times more similar, although there are still some differences:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 1
serial reduction
elapsed clock ticks: 1770000
parallel reduction
elapsed clock ticks: 7370000
serial memset
elapsed clock ticks: 2290000
parallel memset
elapsed clock ticks: 3550000
This should be fairly obvious, I feel like I'm not seeing something very elementary. My CPU is Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz with hyperthreading, but the same behavior is observed at a colleague's i5 with 4 cores and no hyperthreading. We're both running Linux.
EDIT
It seems that one err was on the side of timing, running with:
static double ClockCounter;
void tic()
{
ClockCounter = omp_get_wtime();//std::clock();
}
void toc()
{
ClockCounter = omp_get_wtime()/*std::clock()*/ - ClockCounter;
std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl;
}
yields more "reasonable" times:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 12
serial reduction
elapsed clock ticks: 1.80974
parallel reduction
elapsed clock ticks: 2.07367
serial memset
elapsed clock ticks: 2.37713
parallel memset
elapsed clock ticks: 2.23609
But still, there is no speedup, it is just not slower anymore.
EDIT2:
As suggested by user8046, the code is heavily memory bound. and as suggested by Z boson, the serial code is easily optimized away and it is not certain what is being measured here. So I did a small change of putting the sum outside of the loop, so that it does not zero out at every iteration over c. I also replaced the reduction operation with sum+=F(vec[i]) and memset operation with vec[i] = F(i). Running as:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 -D"F(x)=sqrt(double(x))"
./omp_test
max threads: 12
serial reduction
elapsed clock ticks: 23.9106
parallel reduction
elapsed clock ticks: 3.35519
serial memset
elapsed clock ticks: 43.7344
parallel memset
elapsed clock ticks: 6.50351
Calculating the square root adds more work to the threads and there is finally some reasonable speedup (it is about 7x, which makes sense, as the hyperthreaded cores share memory lanes).