Set the leading zero bits in any size integer C++

Question

I want to set the leading zero bits in any size integer to 1 in standard C++.

eg.

0001 0011 0101 1111 -> 1111 0011 0101 1111

All the algorithms I've found to do this require a rather expensive leading zero count. However, it's odd. There are very fast and easy ways to do other types of bit manipulation such as:

 int y = -x & x; //Extracts lowest set bit, 1110 0101 -> 0000 0001

 int y = (x + 1) & x; //Will clear the trailing ones, 1110 0101 - > 1110 0100

 int y = (x - 1) | x; //Will set the trailing zeros, 0110 0100 - > 0110 0111

So that makes me think there must be a way to set the leading zeros of an integer in one simple line of code consisting of basic bit wise operators. Please tell me there's hope because right now I'm settling for reversing the order of the bits in my integer and then using the fast way of setting trailing zeros and then reversing the integer again to get my leading zeros set to ones. Which is actually significantly faster than using a leading zero count, however still quite slow compared with the other algorithms above.

 template<typename T>
 inline constexpr void reverse(T& x)
 {
    T rev = 0;
    size_t s = sizeof(T) * CHAR_BIT;

    while(s > 0)
    {
        rev = (rev << 1) | (x & 0x01);
        x >>= 1;
        s -= 1uz;
    }//End while

    x = rev;
 }

 
 template<typename T>
 inline constexpr void set_leading_zeros(T& x)
 {

     reverse(x);

     x = (x - 1) | x;//Set trailing 0s to 1s
     
     reverse(x);
 }

Edit

Because some asked: I'm working with MS-DOS running on CPUs ranging from early X86 to a 486DX installed in older CNC machines. Fun times. :D

@dave_thenerd whatever solution you end up with, you can use it to clear the leading ones as well: `clear_leading_ones(x) = ~set_leading_zeroes(~x)` — harold, Sep 18 '22 at 07:16
`rather expensive leading zero count` all modern architectures have a very cheap instruction to get the leading zero count — phuclv, Sep 18 '22 at 10:42
you talked about "standard C++" but which standard? C++11/14/17/20 or 23...? It's also kinda weird because I don't see any modern C++ compilers for MS-DOS — phuclv, Sep 18 '22 at 11:51

score 32 · Answer 1 · answered Sep 18 '22 at 07:06

32

The leading zeroes can be set without counting them, while also avoiding reversing the integer. For convenience I won't do it for a generic integer type T, but likely it can be adapted, or you could use template specialization.

First calculate the mask of all the bits that aren't the leading zeroes, by "spreading" the bits downwards:

uint64_t m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32;

Then set all the bits that that mask doesn't cover:

return x | ~m;

Bonus: this automatically works even when x = 0 and when x has all bits set, one of which in a count-leading-zero approach could lead to an overly large shift amount (which one depends on the details, but almost always one of them is troublesome, since there are 65 distinct cases but only 64 valid shift amounts, if we're talking about uint64_t).

answered Sep 18 '22 at 07:06

harold

61,398
6
86
164

@dave_thenerd: For comparison, https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html shows instruction timings for Pentium and earlier. (It doesn't take into account instruction-fetch costs, which is the main bottleneck for 8086 and especially 8088, not execution. See [this answer](//stackoverflow.com/a/67403962/224132) for more about 8086 / 8088 performance.) `bsr r32,r32` on 386 costs 10+3n cycles, so it's *not* fixed latency!!! Still slow even on Pentium, 7 to 71 cycles, depending on number of leading zeros I guess. BSF is faster than BSR on 486, but still 6 to 42c. – Peter Cordes Sep 19 '22 at 06:27
@dave_thenerd: Given a leading-zero count, shl/sar `reg, cl` or `imm` take 3 cycles each on 386 and 486, but it looks like the bithack is faster. Probably even if you need to do extended-precision shifting (like `shrd`/`shr`) for `uint64_t` on 32-bit CPUs; `shrd reg, reg, imm` or `cl` is only 3 cycles on 386 and 486. (Actually only 2 cycles for `shrd reg, reg, imm` on 486. On Pentium, it's 4 cycles, vs. 1 cycle for regular shifts, pairable in the U pipe.) – Peter Cordes Sep 19 '22 at 06:32
It would be nice to be able to see what `m` looks like at every step. – Eric Duminil Sep 19 '22 at 11:20
How do you know when to stop shifting `m`? For the given example `0001 0011 0101 1111`, after `m |= m >> 2` I had the correct bitmask for `return x | ~m` to return the correct value. – puppydrum64 Dec 21 '22 at 15:21
@puppydrum64 you could detect that `m` is one less than a power of two (add one, then use one of the famous tricks to test whether an integer is a power of two) and then stop, but I expect that won't be worth doing. That's a significant amount of computation (relative to how cheap a couple of shifts and bitwise ORs are) and then a potentially hard to predict branch. – harold Dec 21 '22 at 16:10
I guess I was confused since there were five shifts back to back, and I took that literally (i.e. execute all 6 lines of code regardless of the value of `x` and `m`). – puppydrum64 Dec 21 '22 at 16:16
@puppydrum64 actually I intended it to be used like that: executing all 6 steps unconditionally (or 5 steps, for 32-bit numbers). It's possible to early-exit but it does not look worthwhile to me. – harold Dec 21 '22 at 16:21
Oh, I see, it still works even if you do all 6 steps, but the rigamarole of figuring out whether to exit early would be more difficult than just running all 6 steps anyway so there's no point. – puppydrum64 Dec 21 '22 at 16:22

Ted Lyngmo · Answer 2 · 2022-09-18T07:16:00.020

5

You could count leading zeroes using std::countl_zero and create a bitmask that your bitwise OR with the original value:

#include <bit>
#include <climits>
#include <type_traits>

template<class T>
requires std::is_unsigned_v<T>
T leading_ones(T v) {
    auto lz = std::countl_zero(v);
    return lz ? v | ~T{} << (CHAR_BIT * sizeof v - lz) : v;
}

If you have a std::uint16_t, like

0b0001001101011111

then ~T{} is 0b1111111111111111, CHAR_BIT * sizeof v is 16 and countl_zero(v) is 3. Left shift 0b1111111111111111 16-3 steps:

0b1110000000000000

Bitwise OR with the original:

  0b0001001101011111
| 0b1110000000000000
--------------------
= 0b1111001101011111

edited Sep 18 '22 at 07:16

answered Sep 18 '22 at 06:21

Ted Lyngmo

93,841
5
60
108

1

Thanks, but uhh, I said in the question that I know about leading zero count. I'm trying to avoid it as it is very slow. – dave_thenerd Sep 18 '22 at 06:38
@dave_thenerd Did you try `countl_zero`? I expect it to be pretty quick. – Ted Lyngmo Sep 18 '22 at 06:39
1

You may need to specify the architecture to your compiler, the popcnt instruction may not be enabled by default. (`-march=native` with gcc for instance) – Marc Glisse Sep 18 '22 at 06:43
@MarcGlisse Yes, for an optimal implementation. With `march=native` the `countl_zero` function produces a `lzcnt` instruction on x86_64. Without `march_native`, one `bsr` + an `xor`. I expect even the non-native one to be pretty quick. – Ted Lyngmo Sep 18 '22 at 06:46
I'm working with MS-DOS running on a CNC machine... the latest CPU we use is a 486DX... good times! :D – dave_thenerd Sep 18 '22 at 06:58
@dave_thenerd What assembler code does it produce for the 486DX? I still expect it to be rather quick. – Ted Lyngmo Sep 18 '22 at 07:20
2

486 does have the bitscan instructions, but as a microcoded bit-by-bit loop, taking time proportional to how many bits needs to be scanned – harold Sep 18 '22 at 07:23
@harold Ok. Then I expect that `countl_zero` will use that if there isn't a quicker way to do it. – Ted Lyngmo Sep 18 '22 at 07:25
4

@TedLyngmo: Yes, a modern compiler would use `bsr` instead of a bithack (https://graphics.stanford.edu/~seander/bithacks.html#IntegerLog), because `bsr` *is* fast on modern x86 (1 cycle latency & throughput on P6 and later, not bad on AMD; but surprisingly still slow in P5 Pentium (https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html) with BSR surprisingly even slower than BSF). Harold's answer generating a mask directly (instead of a count) is even more efficient than the bithack which is somewhat equivalent to a binary search for the top bit position. – Peter Cordes Sep 19 '22 at 06:41
@PeterCordes Nice! Good thing OP settled for Harold's answer :-) – Ted Lyngmo Sep 19 '22 at 06:45
With efficient shifts (BMI2, or non-Intel such as AMD), it's maybe better to do `(x << lz) >> lz` to sign-extend. Except if `lz` is the type width; if you need to handle that, generating a mask is probably more efficient. Unfortunately `shl/sar reg, cl` costs 3 uops on Sandybridge-family (because of x86 legacy baggage where shifts don't set FLAGS if the count happens to be zero), so you need BMI2 `shlx` / `sarx` for it to be better than `bsr ecx, dsr` / `mov tmp, -1` / `not ecx` / `shl tmp, cl` / `or dst,reg`... or whatever that has to be. – Peter Cordes Sep 19 '22 at 07:46
https://godbolt.org/z/hKohn8W8a has that idea, which indeed is great if we don't need to handle `x==0`. Also an idea with BMI2 `bzhi`, if we're considering what's efficient with BMI2 available. Like `x | ~ _bzhi_u32(-1, 32-lz);` Unfortunately requires two inversions, the 32-lzcnt and the `~`. We have BMI1 `andn`, but not an equivalent `orn`. And we can't just use `neg` because `bzhi` doesn't mask the count; that's the whole point, it has unique behaviour for 33 different inputs. Will probably post these as an answer tomorrow. – Peter Cordes Sep 19 '22 at 07:49
@PeterCordes Very nice! Looking forward to it in an answer too! – Ted Lyngmo Sep 19 '22 at 07:57

phuclv · Answer 3 · 2022-09-19T09:22:30.517

Your reverse is extremely slow! With an N-bit int you need N iterations to reverse, each at least 6 instructions, then at least 2 instructions to set the trailing bits, and finally N iterations to reverse the value again. OTOH even the simplest leading zero count needs only N iterations, then set the leading bits directly:

template<typename T>
inline constexpr T trivial_ilog2(T x) // Slow, don't use this
{
    if (x == 0) return 0;

    size_t c{};
    while(x)
    {
        x >>= 1;
        c += 1u;
    }

    return c;
}

template<typename T>
inline constexpr T set_leading_zeros(T x)
{
    if (std::make_unsigned_t(x) >> (sizeof(T) * CHAR_BIT - 1)) // top bit is set
        return x;
    return x | (-T(1) << trivial_ilog2(x));
}

x = set_leading_zeros(x);

There are many other ways to count leading zero/get integer logarithm much faster. One of the methods involves doing in steps of powers of 2 like how to create the mask in harold's answer:

But since you're targeting a specific target instead of doing something cross-platform and want to squeeze every bit of performance, there are almost no reasons to use pure standard features since these usecases typically need platform-specific code. If intrinsics are available you should use it, for example in modern C++ there's std::countl_zero but each compiler already has intrinsics to do that which will map to the best instruction sequence for that platform, for example _BitScanReverse or __builtin_clz

If intrinsics aren't available of if the performance is still not enough then try a lookup table. For example here's a solution with 256-element log table

static const char LogTable256[256] = 
{
#define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
    -1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
    LT(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6),
    LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7)
};

uint16_t lut_ilog2_16(uint16_t x)
{
    uint8_t h = x >> 8;
    if (h) return LogTable256[h] + 8;
    else return LogTable256[x & 0xFF];
}

In set_leading_zeros just call lut_ilog2_16 like above

The even better solution than a log table is a mask table so that you can get the mask directly instead of calculating 1 << LogTable256[x]

static const char MaskTable256[256] =
{
    0xFF, 0xFE, 0xFC...
}

Some other notes:

1uz isn't a valid suffix in C++ prior to C++23
Don't use references for small types that fit in a single integer. That's not necessary and is usually slower when not inlined. Just assign the result back from the function

@TedLyngmo that's even weirder. I've never seen a C++11 compiler for DOS, let alone C++23 — phuclv, Sep 19 '22 at 09:22

score 3 · Answer 4 · answered Sep 24 '22 at 01:09

(Work in progress, power just went out here; posting now to save my work.)

Crusty old x86 CPUs have very slow C++20 std::countl_zero / GNU C __builtin_clz (386 bsr = Bit Scan Reverse actually finds the position of the highest set bit, like 31-clz, and is weird for an input of 0 so you need to branch on that.) For CPUs before Pentium Pro / Pentium II, Harold's answer is your best bet, generating a mask directly instead of a count.

(Before 386, shifting by large counts might be better done with partial register shenanigans like mov al, ah / mov ah, 0 instead of shr ax, 8, since 286 and earlier didn't have a barrel shifter for constant-time shifts. But in C++, that's something for the compiler to figure out. Shift by 16 is free since a 32-bit integer can only be kept in a pair of 16-bit registers on 286 or earlier.)

8086 to 286 - no instruction available.
386: bsf/bsr: 10+3n cycles. Worst-case: 10+3*31 = 103c
486: bsf (16 or 32-bit registers): 6-42 cycles; bsr 7-104 cycles (1 cycle less for 16-bit regs).
P5 Pentium: bsf: 6-42 cycles (6-34 for 16-bit); bsr 7-71 cycles. (or 7-39 for 16-bit). Non-pairable.
Intel P6 and later: bsr/bsr: 1 uop with 1 cycle throughput, 3 cycle latency. (PPro / PII and later).
AMD K7/K8/K10/Bulldozer/Zen: bsf/bsr are slowish for a modern CPU. e.g. K10 3 cycle throughput, 4 cycle latency, 6 / 7 m-ops respectively.
Intel Haswell / AMD K10 : lzcnt introduced (as part of BMI1 for Intel, or with its own feature bit for AMD, before tzcnt and the rest of BMI1).
For an input of 0, they return the operand-size, so they fully implement C++20 std::countl_zero / countr_zero respectively, unlike bsr/bsf. (Which leave the destination unmodified on input=0. AMD documents this, Intel implements it in practice on current CPUs at least, but documents the destination register as "undefined" contents. Perhaps some older Intel CPUs are different, otherwise it's just annoying that they don't document the behaviour so software can take advantage.)

On AMD, they're fast, single uop for lzcnt, with tzcnt taking one more (probably a bit-reverse to feed the lzcnt execution unit), so a nice win vs. bsf/bsr. This is why compilers typically use rep bsf when for countr_zero / __builtin_ctz, so it will run as tzcnt on CPUs that support it, but as bsf on older CPUs. They produce the same results for non-zero inputs, unlike bsr/lzcnt.

On Intel, same fast performance as bsf/bsr, even including the output dependency until Skylake fixed that; it's a true dependency for bsf/bsr, but false dependency for tzcnt/lzcnt and popcnt.

Fast algorithm with a bit-scan building block

But on P6 (Pentium Pro) and later, a bit-scan for the highest set bit is likely to be a useful building block for an even faster strategy than log2(width) shift/or operations, especially for uint64_t on a 64-bit machine. (Or maybe even moreso for uint64_t on a 32-bit machine, where each shift would require shifting bits across the gap.)

Cycle counts from https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html which has instructions timings for 8088 through Pentium. (But not counting the instruction-fetch bottleneck which usually dominates 8086 and especially 8088 performance.)

bsr (index of highest set bit) is fast on modern x86: 1 cycle throughput on P6 and later, not bad on AMD. On even more recent x86, BMI1 lzcnt is 1 cycle on AMD as well, and avoids an output dependency (on Skylake and newer). Also it works for an input of 0 (producing the type width aka operand size), unlike bsr which leaves the destination register unmodified.

I think the best version of this (if BMI2 is available) is one inspired by Ted Lyngmo's answer, but changed to shift left / right instead of generating a mask. ISO C++ doesn't guarantee that >> is an arithmetic right shift on signed integer types, but all sane compilers choose that as their implementation-defined behaviour. (For example, GNU C documents it.)

https://godbolt.org/z/hKohn8W8a has that idea, which indeed is great if we don't need to handle x==0.

Also an idea with BMI2 bzhi, if we're considering what's efficient with BMI2 available. Like x | ~ _bzhi_u32(-1, 32-lz); Unfortunately requires two inversions, the 32-lzcnt and the ~. We have BMI1 andn, but not an equivalent orn. And we can't just use neg because bzhi doesn't mask the count; that's the whole point, it has unique behaviour for 33 different inputs. Will probably post these as an answer tomorrow.

int set_leading_zeros(int x){
    int lz = __builtin_clz(x|1);                // clamp the lzcount to 31 at most
    int tmp = (x<<lz);                          // shift out leading zeros, leaving a 1 (or 0 if x==0)
    tmp |= 1ULL<<(CHAR_BIT * sizeof(tmp) - 1);  // set the MSB in case x==0
    return tmp>>lz;                             // sign-extend with an arithmetic right shift.
}

#include <immintrin.h>
uint32_t set_leading_zeros_bmi2(uint32_t x){
    int32_t lz = _lzcnt_u32(x);            // returns 0 to 32 
    uint32_t mask = _bzhi_u32(-1, lz);     // handles all 33 possible values, producing 0 for lz=32
    return x | ~mask;
}

On x86-64 you can

Combined with BMI2 shlx / sarx for single-uop variable-count shifts even on Intel CPUs.

With efficient shifts (BMI2, or non-Intel such as AMD), it's maybe better to do (x << lz) >> lz to sign-extend. Except if lz is the type width; if you need to handle that, generating a mask is probably more efficient.

Unfortunately shl/sar reg, cl costs 3 uops on Sandybridge-family (because of x86 legacy baggage where shifts don't set FLAGS if the count happens to be zero), so you need BMI2 shlx / sarx for it to be better than bsr ecx, dsr / mov tmp, -1 / not ecx / shl tmp, cl / or dst,reg

todo, finish this and link https://www.chessprogramming.org/BitScan#Bsf.2FBsr_x86-64_Timings — Peter Cordes, Oct 04 '22 at 01:55
I just stumbled on this "old" question and noticed that you _did_ come back to write an answer as you hinted you might do. :-) Great! — Ted Lyngmo, Oct 13 '22 at 14:43

Set the leading zero bits in any size integer C++

Edit

4 Answers4

Fast algorithm with a bit-scan building block

Linked