Poor performance due to hyper-threading with OpenMP: how to bind threads to cores

Question

I am developing large dense matrix multiplication code. When I profile the code it sometimes gets about 75% of the peak flops of my four core system and other times gets about 36%. The efficiency does not change between executions of the code. It either starts at 75% and continues with that efficiency or starts at 36% and continues with that efficiency.

I have traced the problem down to hyper-threading and the fact that I set the number of threads to four instead of the default eight. When I disable hyper-threading in the BIOS I get about 75% efficiency consistently (or at least I never see the drastic drop to 36%).

Before I call any parallel code I do omp_set_num_threads(4). I have also tried export OMP_NUM_THREADS=4 before I run my code but it seems to be equivalent.

I don't want to disable hyper-threading in the BIOS. I think I need to bind the four threads to the four cores. I have tested some different cases of GOMP_CPU_AFFINITY but so far I still have the problem that the efficiency is 36% sometimes. What is the mapping with hyper-threading and cores? E.g. do thread 0 and thread 1 correspond to the the same core and thread 2 and thread 3 another core?

How can I bind the threads to each core without thread migration so that I don't have to disable hyper-threading in the BIOS? Maybe I need to look into using sched_setaffinity?

Some details of my current system: Linux kernel 3.13, GCC 4.8,Intel Xeon E5-1620 (four physical cores, eight hyper-threads).

Edit: This seems to be working well so far

export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7"

or

export GOMP_CPU_AFFINITY="0-7"

Edit: This seems also to work well

export OMP_PROC_BIND=true

Edit: These options also work well (gemm is the name of my executable)

numactl -C 0,1,2,3 ./gemm

and

taskset -c 0,1,2,3 ./gemm

Since export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7" gives good results I guess that means that thread 0 and 4 are core 0, thread 1 and 5 are core2, ... i.e. the threads are assigned like electrons in orbitals. It first fills each core (thread 0-3 )and when all cores have a thread it goes back and assign remain threads to the same core (threads 4-7). — Z boson, Jun 23 '14 at 15:22
Both `hwloc-ls` from the hwloc library or `cpuinfo` from Intel MPI provide essential topology information about the machine, e.g. mapping of logical CPU numbers to physical cores/threads. The numbering depends on the BIOS but in my experience most cases have been that hyperthreads are cycled in an "outer loop". Also, you could use the shorthand notation `"0-7"`. — Hristo Iliev, Jun 24 '14 at 08:00
@HristoIliev, for portability it seems the right way to do this is to use OMP_PLACES, e.g. `export OMP_PLACES=cores` from OpenMP4.0. On AMD systems each module only has one FPU but gets two threads and I think it's assigned linearly https://stackoverflow.com/questions/19780554/what-limits-scaling-in-this-simple-openmp-program/19871592#19871592 so doing GOMP_CPU_AFFINITY="0-7" won't work I think. Actually, OMP_PROC_BIND=true might be fine then as well. Maybe that's the best solution. — Z boson, Jun 24 '14 at 08:27
My comment was only that `"0-7"` is the same as `"0 1 2 3 4 5 6 7"`. With libgomp `OMP_PROC_BIND=true` is practically the same as `GOMP_CPU_AFFINITY="0-(#cpus-1)"`, i.e. there is no topology awareness, at least for versions before 4.9. — Hristo Iliev, Jun 24 '14 at 10:27
@HristoIliev, oh, I understand. In that case OMP_PROC_BIND=true on AMD might not work. I might have to do GOMP_CPU_AFFINITY="0 2 4 6 1 3 5 7" with AMD (I don't have a system to test it on). The only advantage then of OMP_PROC_BIND is that GOMP_CPU_AFFINITY depends on GCC. — Z boson, Jun 24 '14 at 11:41
`OMP_PROC_BIND` is supposed to enable some sort of implementation-specific binding. The _places_ feature in OpenMP 4.0 introduces the way for the user to control that binding in an abstract way. With pre-4.0 implementations you should run `hwloc-ls` or `cpuinfo` in order to get the actual topology (or parse `/proc/cpuinfo` on your own). — Hristo Iliev, Jun 24 '14 at 13:48
@HristoIliev, thanks, I think I understand now. I parsed /proc/cpuinfo on my single socket system and my four socket NUMA system. It appears the topology is equivalent to `KMP_AFFINITY=granularity=fine,scatter` with ICC. This is what I want with Intel proccessors. I don't know what the topology is on AMD but I think AMD cores are seen really as distinct cores (they are for integers but not for floats) and is not module aware. That means I have to do something different for AMD systems. That's annoying. — Z boson, Jun 24 '14 at 14:04
On AMD CPU's I had to use GOMP_CPU_AFFINITY="0-24:2" to get decent performance. Cores without FPU are just fake cores to me in this century. — Vladimir F Героям слава, Jun 27 '14 at 22:01
@VladimirF, thanks, that's what I suspected for AMD. That's means I have to do something different for AMD than Intel. — Z boson, Jun 30 '14 at 12:20

score 3 · Answer 1 · answered Jun 23 '14 at 14:43

This isn't a direct answer to your question, but it might be worth looking in to: apparently, hyperthreading can cause your cache to thrash. Have you tried checking out valgrind to see what kind of issue is causing your problem? There might be a quick fix to be had from allocating some junk at the top of every thread's stack so that your threads don't end up kicking each others cache lines out.

It looks like your CPU is 4-way set associative so it's not insane to think that, across 8 threads, you might end up with some really unfortunately aligned accesses. If your matrices are aligned on a multiple of the size of your cache, and if you had pairs of threads accessing areas a cache-multiple apart, any incidental read by a third thread would be enough to start causing conflict misses.

For a quick test -- if you change your input matrices to something that's not a multiple of your cache size (so they're no longer aligned on a boundary) and your problems disappear, then there's a good chance that you're dealing with conflict misses.

I should use valgrind at some point ( have never used it). But the fact that hyper-threading makes things worse is not surprising in my code. Hyper-threading is useful for non-optimized code. Also, when I run GEMM in MKL it uses four threads on my system and not eight. For certain hihgly optimized code hyper-threading actually gives worse results. — Z boson, Jun 23 '14 at 14:51

Poor performance due to hyper-threading with OpenMP: how to bind threads to cores

1 Answers1

Linked