Ok, that's how you approach such things.
When someone claims that his implementation is 2-3x faster with common phrases such as "maximum code reuse", "no extra safety" etc. and doesn't provide any real benchmark, you should raise the red flag in your head. Indeed, all benchmarks in their mail lists/docs have no source code and written (according to results) by hand (so probably violating benchmarking rules) instead of using JMH.
Before hand-waving why something is faster than something else let's write a benchmark and see if it's really faster before making any statements.
Code of benchmark is here: it just tests all basic operations for sets of size 1024 and 1024 * 1024 (~1kk) with fill factor 50%. Tests are run on  Intel Core i7-4870HQ CPU @ 2.50GHz. Score is throughput, the higher the better.
The whole benchmark looks like this:
  @Benchmark
  public boolean getClassic(BitSetState state) {
      return state.bitSet.get(state.nextIndex);
  }
  @Benchmark
  public boolean getOpen(BitSetState state) {
      return state.openBitSet.get(state.nextIndex);
  }
  @Benchmark
  public boolean getOpenFast(BitSetState state) {
      return state.openBitSet.fastGet(state.nextIndex);
  }
Ok, let's see the results:
Benchmark                           (setSize)   Mode  Cnt    Score    Error   Units
BitSetBenchmark.andClassic               1024  thrpt    5  109.541 ± 46.361  ops/us
BitSetBenchmark.andOpen                  1024  thrpt    5  111.039 ±  9.648  ops/us
BitSetBenchmark.cardinalityClassic       1024  thrpt    5   93.509 ± 10.943  ops/us
BitSetBenchmark.cardinalityOpen          1024  thrpt    5   29.216 ±  4.824  ops/us
BitSetBenchmark.getClassic               1024  thrpt    5  291.944 ± 46.907  ops/us
BitSetBenchmark.getOpen                  1024  thrpt    5  245.023 ± 75.144  ops/us
BitSetBenchmark.getOpenFast              1024  thrpt    5  228.563 ± 91.933  ops/us
BitSetBenchmark.orClassic                1024  thrpt    5  121.070 ± 12.220  ops/us
BitSetBenchmark.orOpen                   1024  thrpt    5  107.612 ± 16.579  ops/us
BitSetBenchmark.setClassic               1024  thrpt    5  527.291 ± 26.895  ops/us
BitSetBenchmark.setNextClassic           1024  thrpt    5  592.465 ± 34.926  ops/us
BitSetBenchmark.setNextOpen              1024  thrpt    5  575.186 ± 33.459  ops/us
BitSetBenchmark.setOpen                  1024  thrpt    5  527.568 ± 46.240  ops/us
BitSetBenchmark.setOpenFast              1024  thrpt    5  522.131 ± 54.856  ops/us
Benchmark                           (setSize)   Mode  Cnt    Score    Error   Units
BitSetBenchmark.andClassic            1232896  thrpt    5    0.111 ±  0.009  ops/us
BitSetBenchmark.andOpen               1232896  thrpt    5    0.131 ±  0.010  ops/us
BitSetBenchmark.cardinalityClassic    1232896  thrpt    5    0.174 ±  0.012  ops/us
BitSetBenchmark.cardinalityOpen       1232896  thrpt    5    0.049 ±  0.004  ops/us
BitSetBenchmark.getClassic            1232896  thrpt    5  298.027 ± 40.317  ops/us
BitSetBenchmark.getOpen               1232896  thrpt    5  243.472 ± 87.491  ops/us
BitSetBenchmark.getOpenFast           1232896  thrpt    5  248.743 ± 79.071  ops/us
BitSetBenchmark.orClassic             1232896  thrpt    5    0.135 ±  0.017  ops/us
BitSetBenchmark.orOpen                1232896  thrpt    5    0.131 ±  0.021  ops/us
BitSetBenchmark.setClassic            1232896  thrpt    5  525.137 ± 11.849  ops/us
BitSetBenchmark.setNextClassic        1232896  thrpt    5  597.890 ± 51.158  ops/us
BitSetBenchmark.setNextOpen           1232896  thrpt    5  485.154 ± 63.016  ops/us
BitSetBenchmark.setOpen               1232896  thrpt    5  524.989 ± 27.977  ops/us
BitSetBenchmark.setOpenFast           1232896  thrpt    5  532.943 ± 74.671  ops/us
Surprising, isn't it? What can we learn from results?
- Get and set (including fast versions) are equal in terms of performance. Their results lie in the same error bounds, it's hard to tell any difference without proper nanobenchmarking, so in terms of using bitset in typical application implementation doesn't make any difference and one more if branch doesn't matter. So statement about OpenBitSetget/set better performance is false. UPD: nanobenchmark of get methods doesn't show any difference either, results are here.
- Cardinality of BitSetcan be calculated much faster (~3 times for both 1k and 1kk sizes), so statement about "ultra fast cardinality" is false. But numbers are meaningless without actual answer why performance differs, so let's dig a bit. To count bits in wordsBitSetusesLong#bitCountwhich is Hotspot intrinsic. It means that the wholebitCountmethod will be compiled into single instruction (for the curious ones it will be x86popcnt). WhileOpenBitSetuses hand-rolled bit-counting using tricks from Hacker's Delight (seeorg.apache.lucene.util.BitUtil#pop_array). No wonder why classic version is faster now.
- Group set methods like and/or are both the same, so no performance win here. But interesting thing: - BitSetimplementation tracks maximum index of word where at least one bit is set and perform and/or/cardinality operations only in bounds of [0, maxIndex], so we can compare specific cases, when set has only first 1/10/50% bits set and the rest is not (with same fill factor 50% for given part). Then- BitSetperformance should differ, while- OpenBitSetstay the same. Let's validate (benchmark code):
 - Benchmark                   (fillFactor)  (setSize)   Mode  Cnt   Score    Error   Units
BitSetBenchmark.andClassic          0.01    1232896  thrpt    5  32.036 ±  1.320  ops/us
BitSetBenchmark.andClassic           0.1    1232896  thrpt    5   3.824 ±  0.896  ops/us
BitSetBenchmark.andClassic           0.5    1232896  thrpt    5   0.330 ±  0.027  ops/us
BitSetBenchmark.andClassic             1    1232896  thrpt    5   0.140 ±  0.017  ops/us
BitSetBenchmark.andOpen             0.01    1232896  thrpt    5   0.142 ±  0.008  ops/us
BitSetBenchmark.andOpen              0.1    1232896  thrpt    5   0.128 ±  0.015  ops/us
BitSetBenchmark.andOpen              0.5    1232896  thrpt    5   0.112 ±  0.015  ops/us
BitSetBenchmark.andOpen                1    1232896  thrpt    5   0.132 ±  0.018  ops/us
BitSetBenchmark.orClassic           0.01    1232896  thrpt    5  27.826 ± 13.312  ops/us
BitSetBenchmark.orClassic            0.1    1232896  thrpt    5   3.727 ±  1.161  ops/us
BitSetBenchmark.orClassic            0.5    1232896  thrpt    5   0.342 ±  0.022  ops/us
BitSetBenchmark.orClassic              1    1232896  thrpt    5   0.133 ±  0.021  ops/us
BitSetBenchmark.orOpen              0.01    1232896  thrpt    5   0.133 ±  0.009  ops/us
BitSetBenchmark.orOpen               0.1    1232896  thrpt    5   0.118 ±  0.007  ops/us
BitSetBenchmark.orOpen               0.5    1232896  thrpt    5   0.127 ±  0.018  ops/us
BitSetBenchmark.orOpen                 1    1232896  thrpt    5   0.148 ±  0.023  ops/us
 
The lower part of set is filled, the faster BitSet is and when bits distributed uniformly, then performance of BitSet and OpenBitSet becomes equal, theory confirmed. So for specific non-uniform set bits distributions classic BitSet is faster for group operations. Statement about very fast group operations in OpenBitSet is false.
Summary
This answer and benchmarks don't intend to show that OpenBitSet is bad or that authors are liars. Indeed, according to their benchmark machines (AMD Opteron and Pentium 4) and Java version (1.5) it's easy to believe that earlier BitSet was less optimized, Hotspot compiler wasn't very smart, popcnt instruction didn't exist and then OpenBitSet was a good idea and was much more performant. Moreover, BitSet doesn't expose its internal word array, so it's impossible to create custom fine-grained synchronized bitset or flexible serialization and that's what Lucene was needed. So for Lucene it's still a reasonable choice, while for typical users its better to use standard BitSet, which is faster (in some cases, not generally) and belongs to standard library. Time changes, old performance results changes, so always benchmark and validate your specific cases, maybe for some of them (e.g. not benchmarked iterator or different set fill factor) OpenBitSet will be faster.