Shortly about my problem:
I have a computer with 2 sockets of AMD Opteron 6272 and 64GB RAM.
I run one multithread program on all 32 cores and get speed 15% less in comparison with the case when I run 2 programs, each on one 16 cores socket.
How do I make one-program version as fast as two-programs?
More details:
I have a big number of tasks and want to fully load all 32 cores of the system.
So I pack the tasks in groups by 1000. Such a group needs about 120Mb input data, and take about 10 seconds to complete on one core. To make the test ideal I copy these groups 32 times and using ITBB's parallel_for loop distribute tasks between 32 cores.
I use pthread_setaffinity_np to insure that system would not make my threads jump between cores. And to insure that all cores are used consequtively.
I use mlockall(MCL_FUTURE) to insure that system would not make my memory jump between sockets.
So the code looks like this:
  void operator()(const blocked_range<size_t> &range) const
  {
    for(unsigned int i = range.begin(); i != range.end(); ++i){
      pthread_t I = pthread_self();
      int s;
      cpu_set_t cpuset;
      pthread_t thread = I;
      CPU_ZERO(&cpuset);
      CPU_SET(threadNumberToCpuMap[i], &cpuset);
      s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
      mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
      TaskManager manager;
      for (int j = 0; j < fNTasksPerThr; j++){
        manager.SetData( &(InpData->fInput[j]) );
        manager.Run();
      }
    }
  }
Only the computing time is important to me therefore I prepare input data in separate parallel_for loop. And do not include preparation time in time measurements.
  void operator()(const blocked_range<size_t> &range) const
  {
    for(unsigned int i = range.begin(); i != range.end(); ++i){
      pthread_t I = pthread_self();
      int s;
      cpu_set_t cpuset;
      pthread_t thread = I;
      CPU_ZERO(&cpuset);
      CPU_SET(threadNumberToCpuMap[i], &cpuset);
      s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
      mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
      InpData[i].fInput = new ProgramInputData[fNTasksPerThr];
      for(int j=0; j<fNTasksPerThr; j++){
        InpData[i].fInput[j] = InpDataPerThread.fInput[j];
      }
    }
  }
Now I run all these on 32 cores and see speed of ~1600 tasks per second.
Then I create two version of program, and with taskset and pthread insure that first run on 16 cores of first socket and second - on second socket. I run them one next to each other using simply & command in shell:
program1 & program2 &
Each of these programs achieves speed of ~900 tasks/s. In total this are >1800 tasks/s, which is 15% more than one-program version.
What do I miss?
I consider that may be the problem is in libraries, which I load to memory of muster thread only. Can this be a problem? Can I copy libraries data so it would be available independently on both sockets?