0

I'm looking at clang's output, to see what the C code:

(mask==0xffff ? one : zero)

This produces, where one is set like this:

const __m128i one = _mm_set_epi64x(0, 1);

And the assembly output:

 4e0:   66 0f d7 c0     pmovmskb    eax, xmm0
 4e4:   3d ff ff 00 00  cmp eax, 65535
 4e9:   66 0f ef c0     pxor    xmm0, xmm0
 4ed:   74 06   je  6 <_vm_run+0x465>
 4ef:   66 0f ef c9     pxor    xmm1, xmm1
 4f3:   eb 0a   jmp 10 <_vm_run+0x46F>
 4f5:   b8 01 00 00 00  mov eax, 1
 4fa:   66 48 0f 6e c8  movq    xmm1, rax

My question is why clang doesn't promote one to a register? (there are some unused). Is it a question of call convention? It would have saved quite a few bytes (a move between xmm registers is only 4 or 5 bytes).

EDIT:

Here's a reproductible example: https://godbolt.org/z/qfTZqY

elmattic
  • 12,046
  • 5
  • 43
  • 79
  • What compiler flags are you using? – tadman Oct 15 '18 at 21:59
  • Compile options are -mssse3 -O3 -fno-builtin – elmattic Oct 15 '18 at 22:00
  • Could you write some assembly showing the kind of transformation you would like clang to do? – EOF Oct 15 '18 at 22:01
  • @EOF: basically the two last lines but where xmm1 would be another xmm register and so this would be done once for all. This code is part of a stack-based vm. – elmattic Oct 15 '18 at 22:07
  • @Stringer So the code will be executed in a loop? That would seem like a fairly important detail for this question. – EOF Oct 15 '18 at 22:08
  • @EOF: yes it's in a loop with computed gotos, so an indirect branch, maybe that's the reason? – elmattic Oct 15 '18 at 22:10
  • 3
    What's the surrounding context? In a loop? Can you give a [mcve] of the possible missed-optimization? And BTW, I think this would be more efficient with cmp / `sete al` / `movd xmm1, eax`, unless it predicts very well. If it does, speculative exec removes the data dependency. Another option would be horizontal SIMD, like `psadbw` / `pcmpeqq` and then shuffle high to low / `pand`. (You might need setup for `psadbw` depending on whether your source vector has all bits in each byte set the same. Or you could use `pcmpgtd` on the psadbw result because only MSB-set in all can be high enough.) – Peter Cordes Oct 15 '18 at 22:11
  • @PeterCordes, thanks, yes I will try to extract some minimal sample. – elmattic Oct 15 '18 at 22:14
  • Outside a loop, https://godbolt.org/z/ntJS_Q gcc loads from a constant from memory while clang does what you show. I also included a branchless version that compiles reasonably for CPUs other than AMD Bulldozer-family (where xmm->integer->xmm is high latency). – Peter Cordes Oct 15 '18 at 22:20
  • @PeterCordes, first thank you for all your tips, it was really helpful! Second, sorry for the long delay. I've added a small sample but as you said, you understand my problem: clang does not promote constant (like, 2, 1 or 0 in my example) to xmm registers (gcc is able to do that as you said). I'm looking for a workaround with clang since I can't aford a compiler switch now. Thanks! – elmattic Oct 28 '18 at 09:05
  • @PeterCordes, apparently a while loop with a switch inside seems to do what I want. What do you think? https://godbolt.org/z/xz4KTg – elmattic Oct 28 '18 at 09:38
  • @PeterCordes, my only complaint is that in the infloop/switch clang does a `cmp` and `ja` just beforce the indirect branch. – elmattic Oct 28 '18 at 14:20
  • Branch prediction may work less well with a single common indirect branch, vs. duplication of the dispatch into each block. Modern TAGE branch predictors are very powerful though, and can use recent branch history when indexing, so they do actually make it ok to have a single dispatch branch. [X86 prefetching optimizations: "computed goto" threaded code](https://stackoverflow.com/q/46321531). Getting gcc to optimize away the range check in a switch is a separate problem; [masking like `opcode&3` may be cheaper](/q/3250178), but doesn't solve the problem of extra `jmp` instructions. – Peter Cordes Oct 28 '18 at 14:56
  • More importantly, the code-gen for `cmp2` inlined into your interp loop in the godbolt links in your comment vs. the question looks basically the same. I'm not sure why you think one is better. Also, if you're only using SIMD for `_mm_subs_epi8(acc, two)`, you'd probably gain speed from doing that with scalar code. Or at least simplify `cmp` because nothing can make the high 64 bits of `acc` non-zero. So just compare and do `0-cmpeq_epi64()`, if you can use SSE4.1 `pcmpeqq`. Or use a 16 or 32-bit compare because your add and sub can only modify the low 16 bits. – Peter Cordes Oct 28 '18 at 15:03
  • Oh, I just noticed the `mov eax, 1` / `movd xmm1, eax` in the `inc` and `dec` blocks. Yeah that's odd. But with the switch and goto, clang doesn't know it's ever used. Have you tried using profile-guided optimization so the compiler will know that it's actually hot? https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization – Peter Cordes Oct 28 '18 at 15:09
  • @PeterCordes, the real SIMD code is far more complicated (several factors faster than ALU code). PGO didn't help. My real issue is all the code that is not promoted to registers, always `pxor`ing and `pcmpeq`ing constantly add bloat. – elmattic Oct 28 '18 at 15:21
  • Yes I would rather use "computed goto" since I think it predicts better. The issue is that if I switch now to the infloop/switch, with all the register allocations that is going to take place, it will be hard to compare performances fairly. – elmattic Oct 28 '18 at 15:25
  • _Oh, I just noticed the mov eax, 1 / movd xmm1, eax in the inc and dec blocks. Yeah that's odd._ Exactly that's my main issue with clang. – elmattic Oct 28 '18 at 15:34

0 Answers0