I was inspired by this question and wondering whether it's possible to use multiple SIMD instructions at the same time, since a CPU core may have multiple vector processing unit (page 5 of this slides).
The code is:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];
    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;
    long long sum = 0;
    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    return 0;
}
The assembly code compiled: compiled for AVX512 and compiled for AVX2
After inspecting the assembly code, I discovered that the inner loop (array traversal) was vectorized. In the case of AVX512 (-march=knl, knights landing), each step consists of processing 64 elements, by calling 8 SIMD instructions, each adding 8 elements to the previous result.
The intermediate result is stored in 4 zmm registers, each consisting of 8 elements. Finally 4 zmm registers will be reduced to a single result sum. It seems these SIMD instructions are called serially because it uses the same zmm5 register to store intermediate variable.
a piece of assembly:
# 4 SIMD
vpmovzxdq       zmm5, ymm5    # extends 8 elements from int (32) to long long (64)          
vpaddq  zmm1, zmm1, zmm5      # add to the previous result
vpmovzxdq       zmm5, ymm6    # They are using the same zmm5 register           
vpaddq  zmm2, zmm2, zmm5      # so I think they are not parallelized
vpmovzxdq       zmm5, ymm7              
vpaddq  zmm3, zmm3, zmm5
vpmovzxdq       zmm5, ymm8              
vpaddq  zmm4, zmm4, zmm5
# intermediate result stored in zmm1~zmm4
# read additional 32 elements and repeat the above routine once
# in total 8 SIMD and 64 elements in each FOR step after compilation
My questions is, according to Intel, Knights Landing CPU have 2 vector processing units for each core (page 5 of this slides). Therefore, would it be possible to do 2 AVX512 SIMD computation at the same time?
