For the reduce-add, just do in-lane shuffles and adds (vmovshdup / vaddps / vpermilps imm8/vaddps) like in Fastest way to do horizontal float vector sum on x86 to get a horizontal sum in each 128-bit lane, and then vpermps to shuffle the desired elements to the bottom. Or vcompressps with a constant mask to do the same thing, optionally with a memory destination.
Once packed down to a single vector, you have a normal SIMD 128-bit add.
If your arrays are actually larger than 16, instead of vpermps you could vpermt2ps to take every 4th element from each of two source vectors, setting you up for doing the += part with into x[] 256-bit vectors. (Or combine again with another shuffle into 512-bit vectors, but that will probably bottleneck on shuffle throughput on SKX).
On SKX, vpermt2ps is only a single uop, with 1c throughput / 3c latency, so it's very efficient for how powerful it is. On KNL it has 2c throughput, worse than vpermps, but maybe still worth it. (KNL doesn't have AVX512VL, but for adding to x[] with 256-bit vectors you (or a compiler) can use AVX1 vaddps ymm if you want.)
See https://agner.org/optimize/ for instruction tables.
For the load:
Is this done inside a loop, or repeatedly? (i.e. can you keep a a shuffle-control vector in a register? If so, you could
- do a 128->512 broadcast with
VBROADCASTF32X4 (single uop for a load port).
- do an in-lane shuffle with
vpermilps zmm,zmm,zmm to broadcast a different element within each 128-bit lane. (Has to be separate from the broadcast-load, because a memory-source vpermilps can either have a m512 or m32bcst source. (Instructions typically have their memory broadcast granularity = their element size, unfortunately in some cases like this where it's not at all useful. And vpermilps takes the control vector as a memory operand, not the source data.)
This is slightly better than vpermps zmm,zmm,zmm because the shuffle has 1 cycle latency instead of 3 (on Skylake-avx512).
Even outside a loop, loading a shuffle-control vector might still be your best bet.