C++ multithreading with openMP: poor performance despite localized variables (false sharing?)

Question

I have run into a fairly weird openMP problem.

The task is to take a vector of strings and split each element into its contained k-Mers (all contained substrings of length k). This should parallelize trivially along the elements of the vector, as the k-Merification procedure happens independently for each element. I want to store the results in a map/set STL data structure (std::map<long long, std::map<std::string, std::set<unsigned int> > > local_forReturn) , and I allocate a thread-local variable for that.

The achieved parallelization behaviour, however, is surprisingly bad - top on linux shows CPU usage of ~ 200%, despite running with 40 threads on a 40 core machine. (And I have tested that the #omp critical section is not the bottleneck).

My hunch is that this might be related to false sharing, as the actual data contained in my localized map/set STL classes will end up on the heap. However, I have neither an idea of how to test my intuition, or how to reduce false sharing for STL constructs (if this is the problem). I would greatly appreciate any ideas!

Complete code:

#include <string>
#include <assert.h>
#include <set>
#include <map>
#include <vector>
#include <omp.h>
#include <iostream>

int threads = 40;
int k = 31;

std::string generateRandomSequence(int length);
char randomNucleotide();
std::vector<std::string> partitionStringIntokMers(std::string str, int k);

int main(int argc, char *argv[])
{
    // generate test data
    std::vector<std::string> requiredSEQ;
    for(unsigned int i = 0; i < 10000; i++)
    {
        std::string seq = generateRandomSequence(20000);
        requiredSEQ.push_back(seq);
    }

    // this variable will contain the final result
    std::map<long long, std::map<std::string, std::map<unsigned int, int> > > forReturn;

    omp_set_num_threads(threads);

    std::cerr << "Data generated, now start parallel processing\n" << std::flush;

    // split workload (ie requiredSEQ) according to number of threads
    long long max_i = requiredSEQ.size() - 1;
    long long chunk_size = max_i / threads;
    #pragma omp parallel
    {
        assert(omp_get_num_threads() == threads);
        long long thisThread = omp_get_thread_num();
        long long firstPair = thisThread * chunk_size;
        long long lastPair = (thisThread+1) * chunk_size - 1;
        if((thisThread == (threads-1)) && (lastPair < max_i))
        {
            lastPair = max_i;
        }

        std::map<long long, std::map<std::string, std::map<unsigned int, int> > > local_forReturn;

        for(long long seqI = firstPair; seqI <= lastPair; seqI++)
        {
            const std::string& SEQ_sequence = requiredSEQ.at(seqI);

            const std::vector<std::string> kMersInSegment = partitionStringIntokMers(SEQ_sequence, k);
            for(unsigned int kMerI = 0; kMerI < kMersInSegment.size(); kMerI++)
            {
                const std::string& kMerSeq = kMersInSegment.at(kMerI);
                local_forReturn[seqI][kMerSeq][kMerI]++;
            }   
        }

        #pragma omp critical
        {
            forReturn.insert(local_forReturn.begin(), local_forReturn.end());
        }
    }

    return 0;   
}

std::string generateRandomSequence(int length)
{
    std::string forReturn;
    forReturn.resize(length);
    for(int i = 0; i < length; i++)
    {
        forReturn.at(i) = randomNucleotide();
    }
    return forReturn;
}

char randomNucleotide()
{
    char nucleotides[4] = {'A', 'C', 'G', 'T'};
    int n = rand() % 4;
    assert((n >= 0) && (n <= 3));
    return nucleotides[n];
}


std::vector<std::string> partitionStringIntokMers(std::string str, int k)
{
    std::vector<std::string> forReturn;
    if((int)str.length() >= k)
    {
        forReturn.resize((str.length() - k)+1); 
        for(int i = 0; i <= (int)(str.length() - k); i++)
        {
            std::string kMer = str.substr(i, k);
            assert((int)kMer.length() == k);
            forReturn.at(i) = kMer;
        }
    }
    return forReturn;
}

You are doing a lot of dynamic memory allocation behind the scenes in partitionStringIntoMers(). Copying in the large str param allocates and then copies an entire string. (make that param a const&) Then during the str.substr() and push_back() ops in partitionStringIntoMers, then when you copy the vector of strings back out as the return value. Then the insert() operation in main's inner loop. _Is your dynamic memory allocator parallelized?_ (Or is it just wrapped in a critical section/mutex?) — Wandering Logic, Apr 20 '13 at 03:01
Thank. I have const-ed the variables - this does not make a difference. — Alexander, Apr 20 '13 at 12:55
Regarding the dynamic memory allocation: honestly, I don't know. This is being compiled on g++ 4.7, and I am using the standard allocator. Googling around a bit was not too helpful in trying to find out whether it just uses mutexes. — Alexander, Apr 20 '13 at 12:56
You need to `const &` the variables, not just `const`. The reference is the important part for avoiding the copy. You said in your original post that you had tested that your `#omp critical` is not the bottleneck. How did you do that? Can you apply the same technique to figuring out if the `new`/`delete` operators are the bottleneck? You might also try the [tbb scalable allocator](http://stackoverflow.com/questions/657783/how-does-intel-tbbs-scalable-allocator-work). — Wandering Logic, Apr 20 '13 at 13:09
Also: instead of all that `string` and `vector` alloc and dealloc: I think both your sequential and parallel algorithms will be substantially better off if you just put all your nucleotide sequences in one big vector of char, and then pass around pairs (and references to vectors of pairs) of indexes into the nucleotide vector. — Wandering Logic, Apr 20 '13 at 13:14
You are probably correct in guessing this is a false sharing problem due to the locality of the variables. Im assuming you already tried running the code with `#pragma omp parallel for`? Also, a bit unrelated but is this computational genomics/bioinformatics work? — Michael Aquilina, Nov 26 '13 at 21:37

C++ multithreading with openMP: poor performance despite localized variables (false sharing?)

0 Answers0