I've run across something odd. I am testing an MPI + OMP parallel code on a small local machine with only a single, humble 4 core I3. One of my loops, it turns out, is very slow with more than 1 OMP thread per process in this environment (more threads than cores).
#pragma omp parallel for
for ( int i = 0; i < HEIGHT; ++i )
{
for ( int j = 0; j < WIDTH; ++j )
{
double a =
( data[ sIdx * S_SZ + j + i * WIDTH ] - dMin ) / ( dMax - dMin );
buff[ i ][ j ] = ( unsigned char ) ( 255.0 * a );
}
}
If I run this code with the defaults (without setting OMP_NUM_THREADS, or using omp_set_num_threads), then it takes about 1 s. However, if I explicitly set the number of threads with either method (export OMP_NUM_THREADS=1 or omp_set_num_threads(1)) then it takes about 0.005 s (200X faster).
But it seems that omp_get_num_threads() returns 1 regardless. And in fact, if I just do this omp_set_num_threads( omp_get_num_threads() ); then it takes about 0.005 s, whereas commenting that line out it takes 1 s.
Any idea what is going on here? Why should calling omp_set_num_threads( omp_get_num_threads() ) once at the beginning of a program ever result in a 200X difference in performance?
Some context,
cpu: Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz
g++ --version: g++ (GCC) 10.2.0
compiler flags: mpic++ -std=c++11 -O3 -fpic -fopenmp ...
running program: mpirun -np 4 ./a.out