If I have a __m256i vector containing 32 unsigned 8-bit integers, how can I most efficiently unpack and cast that so I get four __m256 vectors, each containing eight 32-bit float numbers?
I suppose that, once I have them in 32-bit signed integer form, I can cast them to floats via _mm256_cvtepi32_ps so the question probably boils down to how I can most efficiently go from the 8-bit unsigned integer (epu8) representation to the signed 32-bit signed integer (epi32) representation.
There exists _mm256_cvtepu8_epi32(__m128i a) but that only seems to work on the lower (64-bit) half of a __m128i input, whereas I have a __m256i input.
Is there a better way than turning my __m256i input into four __m128i vectors via two calls to _mm256_extracti128_si256(__m256i a, const int imm8), then somehow swapping the upper and lower (64-bit) halves of those __m128i vectors (for a total of four __m128i vectors, each of which has a different 64-bit quarter of the initial __m256i vector in its bottom half), and then doing _mm256_cvtepu8_epi32(__m128i a), followed by _mm256_cvtepi32_ps(__m256i a) on each of them?
Seems pretty messy and I'm wondering if there's a better way. I'm entirely new to vector intrinsics so I'm surely missing something here.
EDIT for more context:
So the setup is that have three pairs of arrays, R1, G1, B1 and R2, G2, B2 of uint8_t pixel values and the computation to be done is the sum of channel-wise squared differences, i.e. square(R1 - R2) + square(G1 - G2) + square(B1 - B2). The differences are currently performed vectorised in uint8_t form max(R1, R2) - min(R1, R2) (etc.), such that 32 uint8_t differences can be computed at a time in a single _mm256_sub_epi8. My question kicks in after I've obtained these differences R_diff, G_diff and B_diff and before squaring them, for which 8-bit integers are too small.