38

I am looking for the fastest / most space efficient way of reducing a 64-bit register to a 32-bit register, only retaining the zero / non-zero status of the 64-bit register.

My current best idea that works for all values is popcntq
(1c tput, 3c latency on mainstream Intel, 5 byte code size):

// rax is either zero or non-zero
popcntq %rax, %rax
// eax will be zero if rax was zero, otherwise it will be non-zero

NOTE: It will not work to just use the 32-bit eax directly: if rax was say 2^61 the zero / non-zero status of eax is not the same as of rax

Is there some better clever method?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Noah
  • 1,647
  • 1
  • 9
  • 18
  • 1
    Cute question. I was thinking about the analogue for ARM64, where we can do `orr x0, x0, x0, lsr #32`. – Nate Eldredge Sep 21 '21 at 18:03
  • 1
    My first thought is `cmp rax, 1; sbc eax, eax`. I don't know how it compares to your current answer, though. It's 6 bytes. I suspect it's cheaper then either popcnt or shr. Edit : it gets the result backwards. – prl Sep 21 '21 at 18:14
  • 1
    `crc32 rax, rax` might be an option if better obscurity helps :-) – Alex Guteniev Sep 21 '21 at 18:23
  • 2
    @prl: That was a good start - replacing `cmp rax, 1` by `neg rax` gets the right result and is a little shorter too. See my answer. – Nate Eldredge Sep 21 '21 at 18:57
  • 5
    Maybe a fun test for a [superoptimizer](https://en.wikipedia.org/wiki/Superoptimization). – Nate Eldredge Sep 21 '21 at 19:03
  • 1
    @NateEldredge took a look at [GNU's](https://www.gnu.org/software/superopt/) but don't see a way to construct the concept of "result = any zero / non-zero value" – Noah Sep 21 '21 at 20:28
  • 2
    @MarkKahn - I'm not sure I understand your bounty reason of "looking for an answer from a reputable source." My answer already cites sources to back up its performance analysis, Agner Fog's instruction tables. Did you also want a link to https://uops.info/, or [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) for how to use the numbers? It's always possible there's a trick nobody's thought of yet, but I think many people would consider me a "reputable source" on this. – Peter Cordes Oct 18 '21 at 01:08

2 Answers2

31

Fewest uops (front-end bandwidth):
1 uop, latency 3c (Intel) or 1c (Zen).
Also smallest code-size, 5 bytes.

 popcnt  %rax, %rax         # 5 bytes, 1 uop for one port
 # if using a different dst, note the output dependency on Intel before ICL

On most CPUs that have it at all, it's 3c latency, 1c throughput (1 uop for one port).
Or 1c on Zen1/2/3 with 0.25c throughput. (https://agner.org/optimize/ & https://uops.info/)

On Bulldozer-family before Excavator, popcnt r64 is 4c latency, 4c throughput. (32-bit operand-size has 2c throughput but still 4c latency.) Bobcat has quite slow microcoded popcnt.


Lowest latency (assuming Haswell or newer so there's no partial-register effect when writing AL and then reading EAX, or a uarch with no P6 ancestry that doesn't rename partial regs):
2 cycle latency, 2 uops, 6 bytes. Also the smallest code-size if popcnt (5B) isn't available.

  test  %rax, %rax     # 3B, 1 uop any ALU port
  setnz %al            # 3B, 1 uop p06 (Haswell .. Ice Lake)
# only works in-place, with the rest of EAX already being known 0 for RAX==0

AL is the low byte of EAX, so AL=1 definitely makes EAX non-zero for any non-zero RAX.

This will cost a merging uop when reading EAX on Sandybridge/Ivy Bridge. Core2 / Nehalem will stall for a couple cycles to insert that merging uop. Earlier P6-family like Pentium-M will fully stall until the setcc retires if a later instruction reads EAX. (Why doesn't GCC use partial registers?)

Nate's neg/sbb is about the same as this on Broadwell and later, but 1 byte shorter. (And does zero the upper 32). This is better on Haswell, where sbb is 2 uops. On earlier mainstream Intel CPUs, they both require 3 uops, with this one needing a merging uop when EAX is read, or sbb (other than sbb $0 on SnB/HSW) always requiring 2 uops. neg/sbb can be used into a different register (still destroying the input), but has a false dependency on CPUs other than AMD. (K8/K10, Bulldozer-family, and Zen-family all recognize sbb same,same as depending only on CF).


If you want the upper 32 zeroed, BMI2 RORX to copy-and-shift:
2 uops, 2c latency, 8 bytes

 rorx  $32, %rax, %rdx      # 6 bytes, 1 uop, 1c latency
 or    %edx, %eax           # 2 bytes, 1c latency
## can produce its result in a different register without a false dependency.

rorx $32 is useful in general for horizontal SWAR reductions, e.g. for dword horizontal sum, you could movq a pair of dwords out of an XMM register and do the last shuffle+add in scalar using rorx/add instead of pshufd/paddd.

Or without BMI2 while still zeroing the upper 32:
7 bytes, 4 uops, 3c latency on Intel (where bswap r64 is 2 uops, 2c latency), otherwise 3 uops 2c latency on CPUs with efficient bswap r64 like Zen-family and Silvermont-family.

 mov    %eax, %edx       # 2 bytes, and not on the critical path
 bswap  %rax             # 3 bytes, vs. 4 for  shr $32, %rax 
 or     %edx, %eax       # 2 bytes
## can write a different destination

A compromise using shr $32, %rax in place of bswap is
8 bytes, 3 uops, 2c latency for the above even without mov-elimination.
Running an ALU instruction on the original register instead of the mov result lets a non-eliminated MOV run in parallel with it.


Background for evaluating "best" performance:


Ideas that didn't pan out:

   bsf %rax, %rax    # 4 bytes.  Fails for RAX=1

Leaves the destination unmodified for input=0. (AMD documents this, Intel implements it but doesn't document it as future-proof.) Unfortunately this produces RAX=0 for an input of 1.
And has the same perf as popcnt on Intel, and worse on AMD, but does save 1 byte of code size.

(Using sub $1 to set CF and then ??; Nate's usage of neg is how to make that work cleanly.)

I didn't try using a superoptimizer to brute-force check for other possible sequences, but as Nate commented this is a short enough problem to be a use-case for one.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    I didn't understand the `setnz` solution at first. Then it dawned on me that we only want `eax` = zero if `rax` was zero already, so it suffices even though the `setnz` itself only sets `al`. – ecm Sep 21 '21 at 20:40
24

One option is

neg rax         ; 48 F7 D8
sbb eax, eax    ; 19 C0

Remember that neg sets flags like a subtract from zero, so it sets the carry flag iff rax is nonzero. And sbb of a register from itself yields 0 or -1 according to whether the carry was clear or set (thanks @prl for suggesting this in a comment).

It's still 5 bytes, and 2 uops instead of 1. But if my math is right, on Skylake you get 2 cycles latency instead of 3, and throughput of 2 per cycle instead of 1.

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • Yup, nice. On Haswell and earlier, `sbb eax,eax` is 2 uops, 2c latency. But good on Broadwell and later. The "false" dependency on EAX (on all CPUs except Bulldozer) is fine because SBB also depends on CF produced by the same instruction that last wrote RAX. This saves a byte vs. the `test rax,rax` / `setnz al` idea (6 bytes, 2 uops, 2c latency) in my answer. – Peter Cordes Sep 21 '21 at 19:20
  • Correction, K8/K10, Bulldozer-family, and Zen-family all recognize `sbb same,same` as depending only on CF according to https://www.agner.org/optimize/microarchitecture.pdf, if you wanted to use this into a different destination register while still destroying the input. – Peter Cordes Sep 22 '21 at 07:54
  • 1
    This (or Peter's `test`+`setnz`) is very likely what an optimizing compiler would generate, so has the advantage of being highly idiomatic, even if not the fastest possible code. (Although, as Michael Abrash famously said, "there ain't no such thing as the fastest code"; someone will always come along and beat what you thought was the fastest/best.) MSVC in particular likes to use clever little `sbb` sequences. GCC would probably stick with the arguably more obvious `setCC`, which used to be less efficient back in the days of P6. – Cody Gray - on strike Sep 23 '21 at 05:31
  • 1
    @CodyGray: You're thinking of P4 where `setCC` was 3 uops with 5 cycle latency, and maybe worse on Prescott if Agner's 9c latency for either a reg or mem destination is right. P6 has had fast `setCC` since day one with PPro / PII / PIII. https://www.agner.org/optimize/instruction_tables.pdf. (On P5 it's single-cycle non-pairable, with a possible decode penalty for the 0F escape byte). – Peter Cordes Sep 23 '21 at 13:15
  • 2
    @CodyGray: Indeed, if you specify `-1` as the value to be returned for a nonzero result, this is what gcc and clang do: https://godbolt.org/z/oPMdYn7v1 – Nate Eldredge Sep 23 '21 at 14:12