I have 32 length-1-to-4 strings stored in AVX2 uint8x32 registers, one register for each of length, byte0, byte1, byte2, byte3. I'd like to concatenate all the strings and write them densely to memory. If all the strings were equal length this would be straightforward: I'd shuffle the bytes to their target positions using pshufb and use some blend calls to mix the byte0/byte1/byte2/byte3 registers together. (Alternatively perhaps I could use vpunpck* instructions. Not yet figured out...)
However, the variable-length aspect makes this harder: where each output byte comes from is now a nontrivial function of the lengths. I can't figure out how to implement this efficiently in AVX2 code. Help?
Bottom line: I'd like a reimplementation of the following function, written in (as fast as possible) vector code rather than scalar code:
int concat_strings(char* dst, __m256i len_v, __m256i byte0_v, __m256i byte1_v, __m256i byte2_v, __m256i byte3_v) {
char len[32];
char byte0[32];
char byte1[32];
char byte2[32];
char byte3[32];
_mm256_store_si256(reinterpret_cast<__m256i*>(len), len_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte0), byte0_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte1), byte1_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte2), byte2_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte3), byte3_v);
int pos = 0;
for (int i = 0; i < 32; ++i) {
dst[pos + 0] = byte0[i];
dst[pos + 1] = byte1[i];
dst[pos + 2] = byte2[i];
dst[pos + 3] = byte3[i];
pos += len[i];
}
return pos;
}
Help?