I've just learned how to optimize GEMM with x86 vector registers, and we were given matrices whose entries are  32-bit int, and just neglect the overflow for simplification.
There's a _mm256_fmadd_pd for double floating-point numbers to update the results C = AB+C, but as for integers there seems no such FMA instructions. I tried first _mm256_mullo_epi32 to neglect overflows and then _mm256_add_epi32 to sum it up like
#include <immintrin.h>
__m256i alpha  = ...// load something from memory
__m256i beta = ...// load something, too
gamma = _mm256_add_epi32( gamma, _mm256_mullo_epi32(alpha,beta) );
// for double variables, gamma = _mm256_fmadd_pd(alpha,beta,gamma);
_mm256_storeu_epi32(..some place,gamma);
the server for the lab has  a Cascade Lake Xeon(R) Gold 6226R with GCC 7.5.0.
Intel Guide tells me the mullo cost more CPIs than mul(nearly twice, and much higher latency), which surely affects performance. Is there any FMA instructions or better implemention in this case?
