NEON can quickly narrow a 128-bit comparision byte mask to 64-bits,
using either 'shift right and narrow (SHRN)'  or 'pack using signed saturation (VQMOVN/SQXTN)'.
This allows the mask to be extracted to a 64-bit general purpose register on __aarch64__.
Once extracted, the mask can be checked for all-zeros or all-ones (-1).
The first match could be found using __builtin_ctzll(m) (rbit/clz).
All matches could be enumerated by clearing any redunant bits then stepping with m &= m - 1.
etc.
See Danila Kutenin's blog post Porting x86 vector bitmask optimizations to Arm NEON.
// find first match
uint8x16_t byte_mask = vceqq_u8(input, vdupq_n_u8('A'))
uint8x8_t nibble_mask = vshrn_n_u16(vreinterpretq_u16_u8(byte_mask), 4);
uint64_t m = vget_lane_u64(vreinterpret_u64_u8(nibble_mask), 0);
if (m != 0) return __builtin_ctzll(m) >> 2; 
If an actual bitmap is needed:
Scalar multiplication could be used to extract bits from each 64-bit word of the comparision mask.
See Arseny Kapoulkine's blog post VPEXPANDB on NEON with Z3.
Similar to Extracting bits with a single multiplication except optimized for bytes being either 0xFF or 0x00.
uint64_t NEON_i8x16_MatchMask (const uint8_t* ptr, uint8_t match_byte) {
    uint8x16_t cmpMask = vceqq_u8(vld1q_u8(ptr), vdupq_n_u8(match_byte));
    // extract each 64-bit lane into a 64-bit general purpose register
    uint64_t hi = vgetq_lane_u64(vreinterpretq_u64_u8(cmpMask), 1);
    uint64_t lo = vgetq_lane_u64(vreinterpretq_u64_u8(cmpMask), 0);
    // extract bits
    const uint64_t magic = 0x000103070f1f3f80ull;
    hi = (hi * magic) >> 56;
    lo = (lo * magic) >> 56;
    return (hi << 8) + lo;
}
Processing 64+ bytes in bulk may allow for a more efficient implementation of the 'movemask' operation.
See Geoff Langdale's blog post Fitting My Head Through The ARM Holes.
Similar to ARM NEON: Convert a binary 8-bit-per-pixel image (only 0/1) to 1-bit-per-pixel?
uint64_t NEON_i8x64_MatchMask (const uint8_t* ptr, uint8_t match_byte)
{
    // load interleaved (slow)
    uint8x16x4_t v = vld4q_u8(ptr); 
    // detect matches
    uint8x16_t tag = vdupq_n_u8(match_byte);
    uint8x16_t v0 = vceqq_u8(v.val[0], tag);
    uint8x16_t v1 = vceqq_u8(v.val[1], tag);
    uint8x16_t v2 = vceqq_u8(v.val[2], tag);
    uint8x16_t v3 = vceqq_u8(v.val[3], tag);
    
    // collect the MSB of 4 bytes vertically into hi-nibble of result byte
    uint8x16_t acc = vsriq_n_u8(vsriq_n_u8(v3, v2, 1), vsriq_n_u8(v1, v0, 1), 2);
    
    // pack the hi-nibbles together
    uint8x8_t r = vshrn_n_u16(vreinterpretq_u16_u8(vsriq_n_u8(acc, acc, 4)), 4);
    return vget_lane_u64(vreinterpret_u64_u8(r), 0);
}