I am developing large dense matrix multiplication code. When I profile the code it sometimes gets about 75% of the peak flops of my four core system and other times gets about 36%. The efficiency does not change between executions of the code. It either starts at 75% and continues with that efficiency or starts at 36% and continues with that efficiency.
I have traced the problem down to hyper-threading and the fact that I set the number of threads to four instead of the default eight. When I disable hyper-threading in the BIOS I get about 75% efficiency consistently (or at least I never see the drastic drop to 36%).
Before I call any parallel code I do omp_set_num_threads(4).  I have also tried export OMP_NUM_THREADS=4 before I run my code but it seems to be equivalent.  
I don't want to disable hyper-threading in the BIOS.  I think I need to bind the four threads to the four cores.  I have tested some different cases of GOMP_CPU_AFFINITY but so far I still have the problem that the efficiency is 36% sometimes.  What is the mapping with hyper-threading and cores?  E.g. do thread 0 and thread 1 correspond to the the same core and thread 2 and thread 3 another core?
How can I bind the threads to each core without thread migration so that I don't have to disable hyper-threading in the BIOS? Maybe I need to look into using sched_setaffinity?
Some details of my current system: Linux kernel 3.13, GCC 4.8,Intel Xeon E5-1620 (four physical cores, eight hyper-threads).
Edit: This seems to be working well so far
export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7"
or
export GOMP_CPU_AFFINITY="0-7"
Edit: This seems also to work well
export OMP_PROC_BIND=true
Edit: These options also work well (gemm is the name of my executable)
numactl -C 0,1,2,3 ./gemm
and
taskset -c 0,1,2,3 ./gemm
 
    