Java JDK BitSet vs Lucene OpenBitSet

Question

I was trying to implement a BloomFilter and came across some discussions regarding BitSets. The Lucene OpenBitSet claims that it is faster than the Java BitSet implementation in almost all of the operations.

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.10.4/org/apache/lucene/util/OpenBitSet.java#OpenBitSet

I tried to look at the code for both the implementations.

Java BitSet code

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/BitSet.java#BitSet

It seems to me that both these classes use an array of 'long' to store the bits. Individual bits are mapped to a particular array index and a bit position in the 'long' value stored at the index.

What is the reason, then that the OpenBitSet implementation is far better in terms of performance ? Where is the difference in the code that leads to this improvement in speed ?

score 25 · Accepted Answer · edited May 23 '17 at 11:54

Ok, that's how you approach such things.

When someone claims that his implementation is 2-3x faster with common phrases such as "maximum code reuse", "no extra safety" etc. and doesn't provide any real benchmark, you should raise the red flag in your head. Indeed, all benchmarks in their mail lists/docs have no source code and written (according to results) by hand (so probably violating benchmarking rules) instead of using JMH.

Before hand-waving why something is faster than something else let's write a benchmark and see if it's really faster before making any statements. Code of benchmark is here: it just tests all basic operations for sets of size 1024 and 1024 * 1024 (~1kk) with fill factor 50%. Tests are run on Intel Core i7-4870HQ CPU @ 2.50GHz. Score is throughput, the higher the better.

The whole benchmark looks like this:

  @Benchmark
  public boolean getClassic(BitSetState state) {
      return state.bitSet.get(state.nextIndex);
  }

  @Benchmark
  public boolean getOpen(BitSetState state) {
      return state.openBitSet.get(state.nextIndex);
  }

  @Benchmark
  public boolean getOpenFast(BitSetState state) {
      return state.openBitSet.fastGet(state.nextIndex);
  }

Ok, let's see the results:

Benchmark                           (setSize)   Mode  Cnt    Score    Error   Units
BitSetBenchmark.andClassic               1024  thrpt    5  109.541 ± 46.361  ops/us
BitSetBenchmark.andOpen                  1024  thrpt    5  111.039 ±  9.648  ops/us
BitSetBenchmark.cardinalityClassic       1024  thrpt    5   93.509 ± 10.943  ops/us
BitSetBenchmark.cardinalityOpen          1024  thrpt    5   29.216 ±  4.824  ops/us
BitSetBenchmark.getClassic               1024  thrpt    5  291.944 ± 46.907  ops/us
BitSetBenchmark.getOpen                  1024  thrpt    5  245.023 ± 75.144  ops/us
BitSetBenchmark.getOpenFast              1024  thrpt    5  228.563 ± 91.933  ops/us
BitSetBenchmark.orClassic                1024  thrpt    5  121.070 ± 12.220  ops/us
BitSetBenchmark.orOpen                   1024  thrpt    5  107.612 ± 16.579  ops/us
BitSetBenchmark.setClassic               1024  thrpt    5  527.291 ± 26.895  ops/us
BitSetBenchmark.setNextClassic           1024  thrpt    5  592.465 ± 34.926  ops/us
BitSetBenchmark.setNextOpen              1024  thrpt    5  575.186 ± 33.459  ops/us
BitSetBenchmark.setOpen                  1024  thrpt    5  527.568 ± 46.240  ops/us
BitSetBenchmark.setOpenFast              1024  thrpt    5  522.131 ± 54.856  ops/us



Benchmark                           (setSize)   Mode  Cnt    Score    Error   Units
BitSetBenchmark.andClassic            1232896  thrpt    5    0.111 ±  0.009  ops/us
BitSetBenchmark.andOpen               1232896  thrpt    5    0.131 ±  0.010  ops/us
BitSetBenchmark.cardinalityClassic    1232896  thrpt    5    0.174 ±  0.012  ops/us
BitSetBenchmark.cardinalityOpen       1232896  thrpt    5    0.049 ±  0.004  ops/us
BitSetBenchmark.getClassic            1232896  thrpt    5  298.027 ± 40.317  ops/us
BitSetBenchmark.getOpen               1232896  thrpt    5  243.472 ± 87.491  ops/us
BitSetBenchmark.getOpenFast           1232896  thrpt    5  248.743 ± 79.071  ops/us
BitSetBenchmark.orClassic             1232896  thrpt    5    0.135 ±  0.017  ops/us
BitSetBenchmark.orOpen                1232896  thrpt    5    0.131 ±  0.021  ops/us
BitSetBenchmark.setClassic            1232896  thrpt    5  525.137 ± 11.849  ops/us
BitSetBenchmark.setNextClassic        1232896  thrpt    5  597.890 ± 51.158  ops/us
BitSetBenchmark.setNextOpen           1232896  thrpt    5  485.154 ± 63.016  ops/us
BitSetBenchmark.setOpen               1232896  thrpt    5  524.989 ± 27.977  ops/us
BitSetBenchmark.setOpenFast           1232896  thrpt    5  532.943 ± 74.671  ops/us

Surprising, isn't it? What can we learn from results?

Get and set (including fast versions) are equal in terms of performance. Their results lie in the same error bounds, it's hard to tell any difference without proper nanobenchmarking, so in terms of using bitset in typical application implementation doesn't make any difference and one more if branch doesn't matter. So statement about OpenBitSet get/set better performance is false. UPD: nanobenchmark of get methods doesn't show any difference either, results are here.
Cardinality of BitSet can be calculated much faster (~3 times for both 1k and 1kk sizes), so statement about "ultra fast cardinality" is false. But numbers are meaningless without actual answer why performance differs, so let's dig a bit. To count bits in words BitSet uses Long#bitCount which is Hotspot intrinsic. It means that the whole bitCount method will be compiled into single instruction (for the curious ones it will be x86 popcnt). While OpenBitSet uses hand-rolled bit-counting using tricks from Hacker's Delight (see org.apache.lucene.util.BitUtil#pop_array). No wonder why classic version is faster now.

Group set methods like and/or are both the same, so no performance win here. But interesting thing: BitSet implementation tracks maximum index of word where at least one bit is set and perform and/or/cardinality operations only in bounds of [0, maxIndex], so we can compare specific cases, when set has only first 1/10/50% bits set and the rest is not (with same fill factor 50% for given part). Then BitSet performance should differ, while OpenBitSet stay the same. Let's validate (benchmark code):

Benchmark                   (fillFactor)  (setSize)   Mode  Cnt   Score    Error   Units
BitSetBenchmark.andClassic          0.01    1232896  thrpt    5  32.036 ±  1.320  ops/us
BitSetBenchmark.andClassic           0.1    1232896  thrpt    5   3.824 ±  0.896  ops/us
BitSetBenchmark.andClassic           0.5    1232896  thrpt    5   0.330 ±  0.027  ops/us
BitSetBenchmark.andClassic             1    1232896  thrpt    5   0.140 ±  0.017  ops/us
BitSetBenchmark.andOpen             0.01    1232896  thrpt    5   0.142 ±  0.008  ops/us
BitSetBenchmark.andOpen              0.1    1232896  thrpt    5   0.128 ±  0.015  ops/us
BitSetBenchmark.andOpen              0.5    1232896  thrpt    5   0.112 ±  0.015  ops/us
BitSetBenchmark.andOpen                1    1232896  thrpt    5   0.132 ±  0.018  ops/us
BitSetBenchmark.orClassic           0.01    1232896  thrpt    5  27.826 ± 13.312  ops/us
BitSetBenchmark.orClassic            0.1    1232896  thrpt    5   3.727 ±  1.161  ops/us
BitSetBenchmark.orClassic            0.5    1232896  thrpt    5   0.342 ±  0.022  ops/us
BitSetBenchmark.orClassic              1    1232896  thrpt    5   0.133 ±  0.021  ops/us
BitSetBenchmark.orOpen              0.01    1232896  thrpt    5   0.133 ±  0.009  ops/us
BitSetBenchmark.orOpen               0.1    1232896  thrpt    5   0.118 ±  0.007  ops/us
BitSetBenchmark.orOpen               0.5    1232896  thrpt    5   0.127 ±  0.018  ops/us
BitSetBenchmark.orOpen                 1    1232896  thrpt    5   0.148 ±  0.023  ops/us

The lower part of set is filled, the faster BitSet is and when bits distributed uniformly, then performance of BitSet and OpenBitSet becomes equal, theory confirmed. So for specific non-uniform set bits distributions classic BitSet is faster for group operations. Statement about very fast group operations in OpenBitSet is false.

Summary

This answer and benchmarks don't intend to show that OpenBitSet is bad or that authors are liars. Indeed, according to their benchmark machines (AMD Opteron and Pentium 4) and Java version (1.5) it's easy to believe that earlier BitSet was less optimized, Hotspot compiler wasn't very smart, popcnt instruction didn't exist and then OpenBitSet was a good idea and was much more performant. Moreover, BitSet doesn't expose its internal word array, so it's impossible to create custom fine-grained synchronized bitset or flexible serialization and that's what Lucene was needed. So for Lucene it's still a reasonable choice, while for typical users its better to use standard BitSet, which is faster (in some cases, not generally) and belongs to standard library. Time changes, old performance results changes, so always benchmark and validate your specific cases, maybe for some of them (e.g. not benchmarked iterator or different set fill factor) OpenBitSet will be faster.

score 0 · Answer 2 · edited May 23 '17 at 12:17

Why OpenBitSet is better from BitSet for performance? Give some related example.

OpenBitSet promises to be 1.5x to 3x faster for cardinality, iteration and get. It can also handle sets of larger cardinality (up to 64 * 2**32-1).
When BitSet is not safe for multithreaded use without external synchronization, OpenBitSet allows to efficiently implement alternate serialization or interchange formats.
For OpenBitSet, extra safety and encapsulation may always be built on top, but in BitSet it is not.
OpenBitSet allow direct access to the array of words storing the bits but in BitSet, it implements a vector of bits that grows as needed.
IndexReader and SegmentMerger are more customized and pluggable in OpenBitSet. in Lucene 3.0 the entire IndexReader class tree was rewritten to not be such as mess with the locking, reopen, and ref counting.
In Solr, if you had a set of documents that small, it would most likely be modeled with a HasDocSet instead of BitDocSet.

As an example,

You are essentially testing sets of size 5000 against sets of size 500,000.

BitSet keeps track of the largest bit you set (which is 5000) and doesn't actually calculate the intersection or the populationCount beyond that. OpenBitSet does not (it tries to do the minimum necessary and make everything as fast as possible.)

So if you changed the single bit you set from 5000 to 499,999, you
should see very different results.

In any case, if one is only going to set a single bit, there are much faster ways of calculating intersection sizes.

If you want to see the performance of OpenBitSet over BitSet, then go through this link: http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/util/OpenBitSet.html

Related Link: Benchmarking results of mysql, lucene and sphinx

It seems to me that both these classes use an array of 'long' to store the bits. What is the reason, then that the OpenBitSet implementation is far better in terms of performance ?

Actually performance depends on which algorithms are set by java.util.BitSet and OpenBitSet. OpenBitSet is faster than java.util.BitSet in most operations and much faster at calculating cardinality of sets and results of set operations. It can also handle sets of larger cardinality (up to 64 * 2**32-1) The OpenBitSet promises to be 1.5x to 3x faster for cardinality, iteration and get.

Resource Link:

The goals of OpenBitSet are the fastest implementation possible, and maximum code reuse. Extra safety and encapsulation may always be built on top, but if that's built in, the cost can never be removed (and hence people re-implement their own version in order to get better performance)

So, if you want a "safe", totally encapsulated (and slower and limited) BitSet class, use java.util.BitSet.

How OpenBitSet Works?

Constructs an OpenBitSet from an existing long[]. The first 64 bits are in long[0], with bit index 0 at the least significant bit, and bit index 63 at the most significant. Given a bit index, the word containing it is long[index/64], and it is at bit number index%64 within that word. numWords are the number of elements in the array that contain set bits (non-zero longs). numWords should be <= bits.length, and any existing words in the array at position >= numWords should be zero.

Resource Link:

Examples of OpenBitSet : http://www.massapi.com/class/op/OpenBitSet.html

Resource Link:

Your answer is all about general info, it doesn't answer the question "why exactly X is faster than Y" — qwwdfsad, May 09 '16 at 10:48
@qwwdfsad It is faster in cardinality, iteration and get this section. If you read it throughly, you can easily know why it is faster. I am also give some keypoints as update portion. — SkyWalker, May 09 '16 at 11:08
Ok, I'm reading through cardinality methods: they are literally the same. Why one of them is faster? — qwwdfsad, May 09 '16 at 11:12
@SkyWalker why not just crop the exact answer? I don't understand how this "wikipedia article" will help anyone coming here to know the answer to OP's question.. — Tair, May 09 '16 at 18:59
@tair I have given the keypoint in first section. Then details. Hope it will help — SkyWalker, May 09 '16 at 19:32
Isn't this answer just the same cut-and-paste documentation repeated three times without actually answering the question? — biziclop, May 10 '16 at 11:44
@biziclop First I have given the keypoint for making difference and then I have given details. Thats it. — SkyWalker, May 10 '16 at 13:07

Tair · Answer 3 · 2016-05-09T20:28:00.807

DISCLAIMER: This answer is done without any research on how efficient are the bitset implementations in question, this is more of a general wisdom about algorithms design.

As stated in docs, OpenBitSet implementation is faster for some specific operations. So, is it better to use it over standard Java BitSet? Probably, yes, but not because of the speed, but because of openness. Why?

When you design algorithms one of the decisions to make: do you want it to equally perform on most cases OR perform better for some specific cases, but probably lose in others?

I assume, the authors of java.util.BitSet took the first route. The Lucene implementation is most probably faster for operations, that are more important for their problem domain. But they also left the implementation open, so that you can override the behavior to optimize for cases important to you.

So what exactly is open in OpenBitSet? The docs tell and sources confirm that the implementation basically exposes underlying representation of bits to subclasses. This is both good and bad: easy to change behavior, but also easy to shoot your own foot. Maybe this is why (just a wild guess!) in newer versions of Lucene they took other path: remove the OpenBitSet in favor of another BitSet implementation, that is yet open, but doesn't expose the data structures. Implementations (FixedBitSet, SparseFixedBitSet) are fully responsible for their own data structures.

References:

https://issues.apache.org/jira/browse/LUCENE-6010

http://lucene.apache.org/core/6_0_0/core/org/apache/lucene/util/BitSet.html

Java JDK BitSet vs Lucene OpenBitSet

3 Answers3

Summary

Why OpenBitSet is better from BitSet for performance? Give some related example.

As an example,

How OpenBitSet Works?

Resource Link:

Resource Link: