1

I want to zero all YMM registers like this.=:

#include <immintrin.h>

void fn(float *out) {
    register __m256 r0;
    _mm256_zeroall();
    _mm256_storeu_ps(out, r0);
}

But gcc/clang gives me a warning:

warning: 'r0' is used uninitialized [-Wuninitialized]

It's okay to use _mm256_setzero_ps() but both the code and generated assembly is ugly.
If I have 12 defined register varaibles, the gcc is likely to generate 12 vmovaps and the clang is likely to generate 12 vxorps instruction. In the worst case, the gcc would generate memset function call and many vmovaps.
I just want a single vzeroall instruction.

Is there any way to let compiler know that _mm256_zeroall() will zeros all register without handwriting asm?

Edit 1: In fact I'm writing a matrix product program, which need to clear many registers at the beginning. To simplify the question, I use the most simple code for question.

I've confirmed vzeroall is not slow compare to many vmovaps/vxorps on Zen 3, and vzeroall has smaller code size, which is more cache friendly.

Remove register qualifier doesn't work on GCC/Clang. It generates the same assembly as the previous one.

I've found that I can specify the register name on GCC to elimiate the warning, like this:

register __m256 r0 asm("ymm0");

But clang doen't obey the define and still generate the same warning.

Zz Tux
  • 600
  • 1
  • 5
  • 18
  • 1
    Have you tried something like `__m256 r0 = _mm256_setzero_ps(); _mm256_storeu_ps(out, r0);`? – David Wohlferd Mar 07 '23 at 03:09
  • @DavidWohlferd Its doesn't work on GCC and Clang. – Zz Tux Mar 07 '23 at 03:13
  • 2
    Please, never use the phrase "It doesn't work" as the sole description of a problem. It tells us nothing about what you are seeing. [godbolt](https://godbolt.org/z/5dcovE1jd), seems to think it works (although it may not give you the code you're hoping for). – David Wohlferd Mar 07 '23 at 03:18
  • 1
    What are you really trying to do where having that many zeroed vector variables is even helpful? I could imagine `vzeroall` as potential setup for a dot product or something where you want multiple vector accumulators all zeroed, but if you're just doing `_mm256_storeu_ps` with the zeros then why not store the same vector repeatedly? – Peter Cordes Mar 07 '23 at 03:28
  • 2
    Also, have you looked at its performance on https://uops.info/? On Haswell `vzeroall` is 20 uops for the front-end, which is more than `vxorps xmm,xmm,xmm` of each register separately. It gets worse on later CPUs, like 29 uops for the front-end (and 9 back-end) on Ice Lake. Alder Lake E-cores run it as a single uop, though. It's not fast on AMD either; 18 uops. So the only way it's useful is when optimizing for Alder Lake E-cores, or for code-size. – Peter Cordes Mar 07 '23 at 03:31
  • @PeterCordes In fact I'm writing a matrix product program. To simplify the question, I use the most simple code for question. It's worth to use `vzeroall` on Zen 3 and other newer architectures because [Agner Fog's Instruction Tables](https://www.agner.org/optimize/instruction_tables.pdf) said its `Reciprocal throughput` is not large compared to a bunch of `vxorps`/`vmovaps`, even if it has more uops. I tested it on my own too and got the same result. I have so many kernels and the redundant code can increase CPU cache pressure, which is the reason I would like to use `vzeroall`. – Zz Tux Mar 07 '23 at 04:12
  • 1
    Agner Fog's table agrees with https://uops.info/ - 18 uops, 6 cycle throughput in 64-bit mode on Zen 3. (Less in 32-bit mode, but only 8 vector regs are accessible in that mode.) `vxorps` is 1 uop with 0.25c throughput, so zeroing all 16 registers would take only 16 uops, and have 4 cycle throughput. `vmovaps` is 1 uop with no back-end port needed (with mov-elimination), 0.17c (1/6) throughput limited only by the front-end. Are you perhaps looking at `vzeroupper` performance? It's much faster. – Peter Cordes Mar 07 '23 at 05:05
  • 1
    Using less space in L1i, and only one "line" of the uop cache, is a valid benefit, though. It's not *that* slow vs. separate instructions if you were going to zero 8 or more vectors, and the reciprocal throughput (which works out to only 3 uops / clock, maybe limited by front-end / microcode stuff) matters less than the total uop count, since it's going to overlap with surrounding code, not repeated runs of `vzeroall`. 8x 4-byte `vxorps` instructions would be more L1i space, but still potentially pack into one "line" of AMD's uop cache, IIRC they do up to 8 uops per line, vs. Intel's 6. – Peter Cordes Mar 07 '23 at 05:10
  • 2
    Using `vzeroall` to zero multiple `__m256` variables is something the compiler should do on its own if worth it, like if you compiled with `-Os` (optimize for size and speed). Compilers probably don't consider this, but if you wanted to file a missed-optimization bug, this would probably be where you'd want them to look for it. I doubt compiler devs would want to give `_mm256_zeroall()` a side-effect on any/all live vector variables, or even on ones with the mostly-deprecated `register` keyword. – Peter Cordes Mar 07 '23 at 05:21
  • 1
    Since you mentioned matrix products: I assume you use the registers as accumulators, correct? In that case consider peeling off the first iteration of your loop so that you can initialize the registers with the first multiplication instead of starting straight with a multiply-add – Homer512 Mar 07 '23 at 07:16
  • The Intel compiler seems to do what you want via the `_mm256_undefined_ps()` intrinsic. See [godbolt](https://godbolt.org/z/M75xdMb4e). In that example it splits the store into 2 128-bit stores but probably just to avoid the windup for 256-bit ops since it's not in a loop. – PhantomPilot Mar 07 '23 at 10:56
  • @PhantomPilot: That's perhaps tuning for Sandybridge to avoid unaligned store penalties? Like GCC's old default of `-mavx256-split-unaligned-store`. But unlikely since it's still there with `-mtune=icelake-server`. The fact that it uses `vzeroupper` at the end of this function looks like it's treating it as having dirtied a 256-bit register, even though it didn't. Also, `-march=haswell` or anything makes it remove the store https://godbolt.org/z/sMsr3vscW - just `vzerloall` / `ret`. IDK if that's a bug or intentional optimization based on read-uninitialized. – Peter Cordes Mar 07 '23 at 12:12
  • Reading a YMM upper half with `vextractf128` should be a "light" 256-bit operation at most, as far as [SIMD instructions lowering CPU frequency](https://stackoverflow.com/q/56852812). So it only needs L0, can turbo as high as scalar code. (The "warm up" effect for 256-bit ops was never truly that, it was throttling of throughput for SIMD ops if the CPU frequency was above the limit, or voltage wasn't high enough at the same frequency.) – Peter Cordes Mar 07 '23 at 12:13

1 Answers1

2

The answer is that, while the instruction's name is vzeroall, it only zeroes out the first 16 vector registers and leave the others unchanged. As a result, the allocator may choose an upper register for your store, which leads to wrong behaviour.

Some more discussion:

Firstly, you are not actually programming in assembly, you are programming in C++ (albeit x86 intrinsics), if you need a variable multiple times, you just use it multiple times, and the compiler will decide to spill if it is necessary. In contrast, even if you define multiple _mm256_setzero_ps(), the compiler will idealise them into a single variable.

Secondly, why do you need multiple zero registers, I believe that most avx instructions are non-destructive, except merge-masking instructions, but merge-masking operations on zero is equivalent to just doing a zero-masking instead. As you said it is for multiple accumulators, and I see that the compilers do not perform loop peeling, then you can manually peel the first iteration instead, which will remove excessive initialisations of zero registers (example).

Quân Anh Mai
  • 396
  • 2
  • 6
  • 1
    This is an [avx] question, and only 16 YMM registers are accessible with VEX prefixes (AVX1 and AVX2). But yes good point that even if this was usable in AVX code, it could break when compiling with AVX-512 enabled if the compiler chose to use any of the AVX-512-only YMM16-31 registers, using longer instructions with EVEX prefixes. – Peter Cordes Mar 07 '23 at 04:22