I've noticed several instances of clang disregarding the documented instructions of masked AVX-512 intrinsics and substituting slower instruction sequences. This really undermines the expectation of programmer control, otherwise, why bother using intrinsics?
Here's an egregious example I've encountered (godbolt) which led to a 3x slowdown with clang's output compared to gcc. Expecting this:
avx512_low_insert:
        vptestnmq       %zmm0, %zmm0, %k0
        movl    $1, %eax
        kmovb   %eax, %k2
        knotb   %k0, %k1
        kaddb   %k2, %k1, %k1
        kandb   %k1, %k0, %k1
        vpbroadcastq    %rdi, %zmm0 {%k1}
we instead obtain (with clang 16.x, current release at time of writing) the much more expensive:
avx512_low_insert:
        vptestmq        %zmm0, %zmm0, %k0
        movb    $1, %al
        kmovd   %eax, %k1
        kaddb   %k1, %k0, %k1
        vptestnmq       %zmm0, %zmm0, %k1 {%k1}
        vpbroadcastq    %rdi, %zmm0 {%k1}
Clang is essentially disregarding the intrinsics specified and substituting its own, inferior, ideas.
Short of hand-rolling inline asm, is there any way I can persuade it otherwise?