If I have array of 16 or 32 or 64 bytes (let's suppose aligned on 64-bytes memory boundary), how do I quickly find index of first byte equal to given, using SIMD SSE2/AVX/AVX2/AVX-512. If such byte not exist for example you can return index equal to size of array.
uint8_t a[16];
size_t index = FindByte(a, 0x37);
of course without SIMD it can be implemented with simple loop
template <size_t Size>
size_t FindByte(uint8_t const (&arr)[Size], uint8_t b) {
    for (size_t i = 0; i < Size; ++i)
        if (arr[i] == b)
            return i;
    return Size;
}
These above are just toy examples. In real application I have to check not a single array but millions of arrays per second. That's why I need to do this very fast, hence I was thinking of doing this using SIMD instructions. Also multi-threading here will not solve the problem because I have sequential queries to FindByte(), I can't run them in parallel.
I tried to implement following solution with SIMD SSE2 and two precomputed tables:
#include <cstdint>
#include <iostream>
#include <array>
#include <emmintrin.h>
template <size_t Size>
size_t FindByte(uint8_t const (&arr)[Size], uint8_t b) {
    static auto const precompute = []{
        std::array<__m128i, 256> val{};
        for (size_t i = 0; i < 256; ++i)
            val[i] = _mm_set1_epi8(i);
        std::array<uint8_t, 1 << 16> pos{};
        for (size_t i = 0; i < (1 << 16); ++i) {
            size_t j = 0;
            for (j = 0; j < 16; ++j)
                if (i & (1 << j))
                    break;
            pos[i] = j;
        }
        return std::make_pair(std::move(val), std::move(pos));
    }();
    auto const cmp = _mm_cmpeq_epi8(
        _mm_loadu_si128((__m128i*)&arr),
        precompute.first[b]);
    return precompute.second[_mm_movemask_epi8(cmp)];
}
int main() {
    uint8_t arr[16] = {3, 5, 7, 11, 13, 17, 19};
    std::cout << FindByte(arr, 17) << " "
        << FindByte(arr, 23) << std::endl;
}
But this solution above might be non-optimal, because 64 KB table may give L1 cache misses, it only fits into L2 cache.
Also I expect that it can be solved without precomputed tables, just with SIMD instructions alone.
Also I need not only SSE2 solution that I did above, but also AVX/AVX2/AVX512 for the case of 32/64 bytes array.
Update. As question is closed, posting here my final working-fast solution for SSE2 that I figured out with the help of comments of @PeterCordes:
#include <cstdint>
#include <iostream>
#include <bit>
#include <emmintrin.h>
size_t FindByte(uint8_t const * arr16, uint8_t b) {
    return std::countr_zero(uint16_t(
        _mm_movemask_epi8(_mm_cmpeq_epi8(
            _mm_loadu_si128((__m128i*)arr16), _mm_set1_epi8(b)))
    ));
}
int main() {
    uint8_t arr[16] = {3, 5, 7, 11, 13, 17, 19};
    std::cout << FindByte(arr, 17) << " "
        << FindByte(arr, 23) << std::endl;
}
Console output:
5 16
 
    