efficiently initialize unordered_map with large dataset of integer pairs

Question

I have a huge array (say, ParticleId[]) of unique integers (representing particle IDs) stored in random order in memory. I need to build a hash table to map each ID to its location inside the array, i.e., from ID to index. The IDs are not necessarily continuous integers so a simple look-up array is not a good solution.

I am currently using the unordered_map container of c++11 to achieve this. The map is initialized with a loop:

unordered_map <ParticleId_t, ParticleIndex_t> ParticleHash;
ParticleHash.rehash(NumberOfParticles);
ParticleHash.reserve(NumberOfParticles);
for(ParticleIndex_t i=0;i<NumberOfParticles;i++)
  ParticleHash[ParticleId[i]]=i;

The ParticleId_t and ParticleIndex_t are simply typedef-ed integers. NumberOfParticles can be huge (e.g., 1e9). ParticleId[] array and NumberOfParticles are const as far as the hash table is concerned.

Currently it takes quite a lot of time to build the unordered_map as above. My questions are:

Is unordered_map the optimal choice for this problem?
- would map be faster to initialize, although it may not be as efficient in the look-up?
Is it possible to speed up the initialization?
- Is it much faster to use ParticleHash.insert() than ParticleHash[]=? or any other functions?
- Given that my keys are known to be unique integers, is there a way to optimize the map as well as the insertion?

I am considering the intel concurrent_unordered_map to parallelize it. However, that would introduce a dependence on the intel TBB library, which I would like to avoid if possible. Is there an easy solution using the native STL containers?

Update:

Now I've reverted to a plain sorted index table and rely on bsearch for lookup. At least the initialization of the table is now 20 times faster and can be easily parallelized.

Have a look at this - including the comment about specifying the bucket size in the constructor: http://stackoverflow.com/questions/11614106/is-gcc-stdunordered-map-implementation-slow-if-so-why — Jerry Jeremiah, Oct 14 '15 at 02:32
Using `std::map` you can pass a hint iterator to speed up insertion. If you know the next key is last in the map, you can pass the end iterator as the hint I believe. I don't know if this will be faster than unordered map. Also consider some flat_map data structures provided by boost. — Neil Kirk, Oct 14 '15 at 03:11
@JerryJeremiah: ah, I am using gcc4.7.2. maybe this is the cause. I have to find another compiler before confirming this.. — Kambrian, Oct 14 '15 at 15:23
@NeilKirk: by last in the map you mean the last inserted or the last in the ordered keys? If the latter I think I would better go for my own implementation that first sort and then binary_search — Kambrian, Oct 14 '15 at 16:04
switching from gcc4.7.2 to 4.8.1 speeds up by a factor of 2. Turing on optimization -O3 speeds up by another factor of 2. — Kambrian, Oct 14 '15 at 17:07
Last in terms of key ordering. These always go on the end, so there's no need for a search. — Neil Kirk, Oct 14 '15 at 19:14

ramana_k · Answer 1 · 2015-10-14T03:09:15.137

It seems the application building the look up table is memory bound, not cpu bound. This can probably be verified by profiling a prototype of the application. The rest of this answer assumes this to be true.

The process building the look up table is taking a global view of the input data and this may be contributing to lot of swapping in/out of memory to/from disk.

If that is the case, the solution is an alternative algorithm that deals with smaller chunks of memory at a time. Suppose there are 1 million integers. The current process may be inserting into the low end of hash table closer to 1 at this moment and in the next moment it may be inserting into the high end closer to 1 million. This leads to lot of swapping.

An alternative approach would avoid the swapping by dealing with smaller chunks of the data set at a time. We could borrow ideas from bucket/radix sort. In this approach, the step of building a look up table would be replaced by a sorting step. Bucket/Radix sort are supposed to run in linear time. The fact that all the integers in the data set are unique is another reason to use these sorting algorithms. If linear time sorting and minimization of swapping could be combined, that may improve performance.

My case is actually not memory bound since I am running on a supercomputer with more than enough memories. — Kambrian, Oct 14 '15 at 15:24

score 1 · Answer 2 · answered Oct 14 '15 at 02:30

I don't think there's a lot you can do with this, but here's a few things to try.

First, since you're calling realloc you don't need to call rehash.

insert is potentially faster than operator[] since operator[] will call insert to add an element to the map with the default value, then assign your value to the newly inserted element, but the optimizer may be able to eliminate the extra work.

Just because the keys are unique, the hashed value of those keys may not be as I don't think the language spec requires that an integer hash returns that integer (the section that describes the hash template doesn't say it, anyway).

'map' would probably be slower to initialize, since it would have to keep rebalancing the tree as you inserted things, and the lookups would be slower. One alternative to map you could use if your ParticleID vector could be rearranged would be to sort your vector, then do a binary_search to find where the ID is and compute the index. But it would have similar performance to map and require the rearranging of the vector.

If you decide to try concurrent_unordered_map you probably won't see much improvement after 3 or 4 threads because of all the memory contention between threads.

I did `rehash` before `reserve` to explicitly set the number of buckets, hoping that it could help to reduce collision to the minimum.Good to know that the default hash function is not required to return the same integer. Maybe I should pass my own hasher. My old implementation was indeed a `binary_search` on sorted IDs. It's sad to know that a dedicated hashmap container is not doing better.. — Kambrian, Oct 14 '15 at 16:01
Could you profile the app to see which parts are taking most time? Given the cpu, memory and other resources of a super computer, I wonder what is acting as bottleneck. Are there any quotas at OS level and if so enough cpu/memory has been allocated to this app? — ramana_k, Oct 14 '15 at 16:43

Tony Delroy · Answer 3 · 2015-10-15T02:34:30.400

0

Given "huge array of unique integers stored in random order" - is there anything already depending on that random order? If not, just sort the array of unique integers in-place, and to map from unique integer to index, you do a std::lower_bound in the array.

If there is a need to preserve the pre-exsting order of the huge array, but you're building an index as a one-off step after that array's been populated (as your illustrative code does), you can create a similarly huge array of ParticleId* and std::sort them based on the pointed-to elements (you'll need a custom < comparison of the pointed-to values); afterwards you can use std::lower_bound with the same < comparison to quickly find the index in huge array of a specific ParticleId.

The above contiguous array approaches benefits hugely in performance and memory use from the use of contiguous memory in a cache-friendly fashion.

Only if you have a large number of new ParticleIds coming in or being removed during the time you need to be able to search would you need to consider std::unordered_map.

edited Oct 15 '15 at 02:34

answered Oct 14 '15 at 11:00

Tony Delroy

102,968
15
177
252

Sorting the data is not optimal since there are much more data associated with each particle in the same order. Creating another array was essentially what I was doing before trying out `unordered_map`. – Kambrian Oct 14 '15 at 15:33
@Kambrian: *"creating another array was essentially what I was doing"* - and was it not fast enough? Were you using pointers into the first array? Anyway, if you want to pursue a hash table, you're much better off - for this specific usage pattern and given `sizeof(ParticleId)` is small - writing or finding your own that uses open addressing / closed hashing; I tend to find that an order of magnitude faster (than `unordered_map`) on my hardware for similar usage, and you can dramatically reduce memory overheads (although the contiguous array will always be more efficient). – Tony Delroy Oct 15 '15 at 02:33
I create an array of (ID, Index) pairs and then sort it by ID. This array is then searched with binary search to find the index for each ID query. I was naively expecting a dedicated hash table might do something smarter than this. But it is not bad to find that it is not for this special application. Thanks! – Kambrian Oct 17 '15 at 21:44

efficiently initialize unordered_map with large dataset of integer pairs

3 Answers3