Yes, you eventually want a 64-bit OR to look for any non-zero bits in either half, but it's not efficient to get those uint64_t values from a 128-bit load and then extract.
In asm you just want a mov load and a memory-source or or add, which will set ZF just like you're doing now. Two loads from the same cache line are very cheap; current CPUs have at least 2/clock load throughput. The extra ALU work to extract from a single 128-bit load is just not worth it, even if you did shuffle / por to set up for a single movq.
In C++, use memcpy to do strict-aliasing safe loads of uint64_t tmp vars, then if(a | b). This is still SIMD, just SWAR (SIMD Within A Register).
add is even better than or: it can macro-fuse with most jcc instructions on Intel Sandybridge-family (but not AMD). or can't fuse with branch instructions on any CPUs. Since your values are 0 or 1, we can't have a case of two non-zero values adding to produce a zero, which is why you'd normally use or for the general case.
(Some addressing modes may defeat micro or macro-fusion on Intel. Or maybe it always works since there's no immediate involved. It really is possible for add rax, [mem] / jnz to go through the front-end and ROB as a single uop, and execute in the back-end as only 2 (load + add/sub-and-branch). Assuming it's about the same as cmp on my Skylake, except it does write the destination so Haswell and later can maybe keep it micro-fused even for indexed addressing modes.)
uint64_t a, b;
memcpy(&a, noise_frame_flags+0, sizeof(a)); // strict-aliasing-safe loads
memcpy(&b, noise_frame_flags+8, sizeof(b)); // which optimize to MOV qword
bool isNoiseToCancel = a + b; // equivalent to a | b for bool inputs
This should compile to 3 asm instructions which will decode to 2 uops total, or 3 on AMD CPUs where JCC can only fuse with cmp or test.
union { alignas(16) uint8_t flags[16]; uint64_t chunks[2];}; would be safe in C99, but not ISO C++. Most but not all C++ compilers that support Intel intrinsics define the behaviour of union type-punning. (I think @jww has said SunCC doesn't.)
In C++11, you don't need a custom macro for ALIGNTO(16), just use alignas(16). Also supported in C11 if you #include <stdalign.h>
Alternatives:
movdqa 16-byte load / SSE4.1 ptest xmm0, xmm0 / jnz - 4 uops on Intel CPUs, 3 on AMD.
Intel runs ptest as 2 uops, and it can't macro-fuse with jcc.
AMD CPUs run ptest as 1 uop, but it still can't fuse.
If you had an all-ones or all-zeros constant in a register, ptest xmm0, [mem] would work to save a uop on Intel (depending on addressing mode), but that's still 3 total.
PTEST is only good for checking a 32-byte array with AVX1 or AVX2. (Surprisingly, vptest ymm only requires AVX1). Then it's about break-even with AVX2 vmovdqa / vpslld ymm0, 7 / vpmovmskb eax,ymm0 / test+jnz. See TrentP's answer for portable GNU C native vector source code that should compile to vptest on x86 with AVX available, and maybe to something clunky on other ISAs like ARM depending on how good their horizontal OR support is.
popcnt wouldn't be useful unless you want to break down the work depending on how many bits are set.
In that case, yes, sure, you can turn the bool array into a bitmap that you can scan easily, probably more efficient than _mm_sad_epu8 against a zeroed register to sum into two 8-byte halves.
__m128i vflags = _mm_load_si128((__m128i*)noise_frame_flags);
vflags = _mm_slli_epi32(vflags, 7);
unsigned flagmask = _mm_movemask_epi8(vflags);
if (flagmask) {
unsigned flagcount = __builtin_popcount(flagmask); // popcnt with -march=nehalem or higher
unsigned first_setflag = __builtin_ctz(flagmask); // tzcnt if available, else BSF
vflags &= vflags - 1; // clear lowest set bit. blsr if compiled with -march=haswell or bdver2 or newer.
...
}
(Don't actually use -march=bdver2 or -march=nehalem, unless you want to set an ISA baseline but also use -mtune=haswell or something more modern. There are individual options like -mpopcnt and -mbmi, but generally good to enable all ISA extensions that some CPU supports, so you don't miss out on useful stuff the compiler can use.)