According to the documentation, from gcc 4.9 on the AVX-512 instruction set is supported, but I have gcc 4.8. I currently have code like this for summing up a block of memory (it's guaranteed to be less than 256 bytes, so no overflow worries):
__mm128i sum = _mm_add_epi16(sum, _mm_cvtepu8_epi16(*(__m128i *) &mem));
Now, looking through the documentation, if we have, say, four bytes left over, I could use:
__mm128i sum = _mm_add_epi16(sum,
                             _mm_mask_cvtepu8_epi16(_mm_set1_epi16(0),
                                                    (__mmask8)_mm_set_epi16(0,0,0,0,1,1,1,1),
                                                    *(__m128i *) &mem));
(Note, the type of __mmask8 doesn't seem to be documented anywhere I can find, so I am guessing...)
However, _mm_mask_cvtepu8_epi16 is an AVX-512 instruction, so is there a way to duplicate this? I tried:
mm_mullo_epi16(_mm_set_epi16(0,0,0,0,1,1,1,1),
               _mm_cvtepu8_epi16(*(__m128i *) &mem));
However, there was a cache stall so just a direct for (int i = 0; i < remaining_bytes; i++) sum += mem[i]; gave better performance.