If I want to implement 128-bit atomic type on x64, can I get with _mm_store_si128 and _mm_load_si128 to avoid cmpxchg16b for relaxed load and store?
(If needed, can assume that only load and store are needed, although it would be good if I can mix those with cmpxchg16b)