What's the cost of moving data between VPU and FPU registers?

Question

On PowerPC (PS3), transferring between the vector registers and the floating point registers also passed through memory, which could lead to expensive cache misses and thus required minimizing unnecessary conversions.

Is this true for other modern architectures? I'm especially curious about mobile, where my understanding is that the memory latency is by far the limiting factor.

NOTE: This is for a low level 3D math library using SSE intrinsics (and eventually others), and I'm trying to optimize for memory latency.

On x86, scalar FP operations are done in the *same* XMM registers as vector FP operations. The legacy x87 registers / instructions are unused in modern code. Compare https://github.com/HJLebbink/asm-dude/wiki/ADDSS (scalar single) vs. https://github.com/HJLebbink/asm-dude/wiki/ADDPS (packed single). See also http://agner.org/optimize/ — Peter Cordes, Mar 16 '18 at 08:38
@PeterCordes, long double uses x87 for many compilers and it's used for fractals. I can zoom to 10^-4000 and lower using long double and perturbation. Double double would be useless because it does not improve the precession of the exponent and any software implementation that improves the precision of the exponent (e.g. quad precession) can't compete with long double on x87. — Z boson, Mar 16 '18 at 10:31
@Zboson: Yes, I was over-simplifying by leaving out `long double`, which still uses x87 in the x86-64 System V ABI. (I think Windows x86-64 uses 64-bit `long double` with SSE2, i.e. the same as `double`.) But you can't operate on 80-bit floats using SSE2 or any other x86 SIMD, so the use-cases for x87 <-> XMM are limited mostly to 32-bit code with legacy calling conventions where FP values are returned in x87 registers. If you need `long double` precision, you have to avoid `fst qword [rsp]` to convert to `double`. — Peter Cordes, Mar 17 '18 at 00:24
Related: [Intel's intrinsics don't provide a good way to turn scalars into vectors without making compilers waste an instruction zero-extending them](https://stackoverflow.com/questions/39318496/how-to-merge-a-scalar-into-a-vector-without-the-compiler-wasting-an-instruction). clang can optimize that out, though, if the upper elements are actually unused. — Peter Cordes, Mar 17 '18 at 00:27
So it sounds like for x86, this would be a cheap instruction, and not have a mem hit. What about ARM? — johnb003, Mar 26 '18 at 17:06
ARM and AARch64 also use the same registers for scalar FP and for NEON SIMD. You forgot to ping me with @PeterCordes, so I only just happened to see your reply when searching for something else. — Peter Cordes, Apr 26 '18 at 01:11
@PeterCordes If you'd like to summarize your comments in an answer, I think all of the info I was after is covered. — johnb003, Jun 21 '18 at 05:49

What's the cost of moving data between VPU and FPU registers?

0 Answers0