Do the MMX registers always exist in modern processors?

Question

When I look at diagrams and overviews of recent processors[1], I never see mention of the MMX registers MM0 - MM7. But from the specs, it seems like they still exist. Can one depend on them being present in all processors that support SSE? Do they conflict with anything other than the even older FPU stack? Are they the same physical registers as the general 64-bit ones?

While XMM and YMM are much better for vectors, I occasionally want to use the MMX registers for stashing values that would otherwise spill to the stack. Speedwise this looks a little better, and also there are times when I want to avoid additional stores and loads.

[1] http://www.realworldtech.com/haswell-cpu/

It's generally a good idea to use the information returned from [CPUID](http://en.wikipedia.org/wiki/CPUID) to select an appropriate code path. — Michael, Jun 07 '13 at 09:58

Peter Cordes · Answer 1 · 2019-07-12T01:42:47.477

SSE1 implies MMX, so yes supporting x86-64 guarantees MMX (because SSE2 is baseline for x86-64).

They alias the 80-bit x87 regs, not the general-purpose integer registers! Long mode doesn't change anything about how MMX works.

All modern CPUs are 64-bit capable and thus have MMX available in all modes. Even 32-bit only embedded AMD Geode CPUs have MMX (but not SSE).

It's pretty rare that MMX is worth using when you have 16x XMM regs + 16x 64-bit GP regs. Store/reload is not terrible, especially if the reload can use a memory source operand.

Extra ALU uops to move data to/from MMX regs is usually not worth it vs. store/reload. Reload can often be micro-fused as a memory source operand, and the ALU execution port pressure can easily be a problem.

If you were doing something special with cache disabled then sure, but normally store-forwarding makes store/reload efficient if you can keep it off the critical path. (It does have ~5 cycle latency).

If you do want to move data between XMM and GP regs, though, typically movd / movq or pinsrd / pextrd are a good choice, not store/reload. I'm saying that a spill/reload of a GP or XMM reg in an outer loop is usually better than 2x movq or movq2dq xmm0, mm0.

In fact on Skylake, one movq2dq costs 2 uops. Same for movdq2q. (movq to/from GP regs is still only 1 uop, though, with the same port 0 or port 5 limitation as transfers between XMM and GP regs).

Plus, using MMX in a function costs you an emms instruction at the end of it (or before any function call if you want to be ABI compliant). The MMX regs are all call-clobbered in normal calling conventions (and in fact the FPU has to be in x87 state instead of MMX state).

MMX is definitely not as efficient as XMM on modern CPUs. Actually using it for anything other than storage is usually worse than SSE2 (with movq loads/stores and ignoring the high bytes of XMM regs, if you want to work in 64-bit chunks).

For example, on Intel/AMD CPUs with mov-elimination for movaps xmm,xmm, MMX register-copy with movq xmm1, xmm0 still costs an ALU uop and still has 1 cycle of latency. (Both still cost a uop for the front-end; mov-elimination only removes the latency and back-end cost other than the ROB entry.)

Also, Skylake has better throughput for the XMM version of some instructions than for the MMX version. e.g. paddb/w/d/q mm,mm runs on p05, but paddb/w/d/q xmm,xmm runs on p015. Many other operations, like pavg*, pmadd*, and shifts, can run on p01 for XMM regs, but only port 0 for MMX regs. (https://agner.org/optimize/)

So like x87 FPU, it's still supported for legacy code, but it has fewer execution units that support it. It's not terrible yet, so software like x264 and FFmpeg that still have significant amounts of MMX code for stuff that natural works in 64-bit chunks don't suffer too badly.

128-bit AVX versions of integer instructions would probably be the best bet in many cases to avoid register-copy mov instructions.

Robert Houghton · Answer 2 · 2019-07-12T01:26:39.860

The best "diagrams and overviews" to look at is always the manual, in this case you'll find lots of information on MMX technology and the proceeding SSE (streaming SIMD extensions) starting in Section 5.4 of the Intel Manual, that's pg. 122 in the 4-volume set's pdf. To get deeper into programming with MMX, you'll want to start in section 9.2 (p.228). Personally I really like Intel's "C++ Compiler for Linux* Intrinsics Reference," to learn more than you may ever need to know about MMX. Here's a copy: https://www.cs.fsu.edu/~engelen/courses/HPC-adv/intref_cls.pdf

Can one depend on them being present in all processors that support SSE?

Yes. SSE means MMX is present. As mentioned in the comments, you'll want to use the CPUID intrinsic to check:

CPUID.01H:EDX.MMX[bit 23] = 1

or just keep in mind MMX tech came out in 1997, I see the year this question was posted is 2013, edited in 2014 so...

Do they conflict with anything other than the even older FPU stack?

No, but that is strange isn't it? The MMX state is aliased to the x87 FPU state. The reasoning though is to avoid compatibility problems with the context switch mechanisms in existing operating systems. They are unique to the FPU registers in the sense that they are directly addressable so maybe that's why you are drawn to them. Plus they were designed to work on packed data types! However, this mapping makes it difficult to work on floating point and SIMD data in the same application.

Are they the same physical registers as the general 64-bit ones?

This question was a little confusing. When you say general 64-bit one's you mean the 16 General Purpose Registers in a x64 computer right? Or the eight 80-bit FPU Data Registers, which operate like a stack? Either way, the MMX registers are NOT separate from the x87 FPU data register stack. The Intel Manual seems to embrace how misleading these MMX registers are by saying:

Although MMX registers are defined in the IA-32 architecture as separate registers, they are aliased to the registers in the FPU data register stack (R0 through R7)

-Section 9.2.2, p.229

There's 8 MMX registers (64-bit). But as you can tell there's ALOT of registers for you to use! The confusing part is that instructions that save and restore the x87 state also handle the MMX state.

When an MMX instruction (other than the EMMS instruction) is executed, the processor changes the x87 FPU state as follows:
• The TOS (top of stack) value of the x87 FPU status word is set to 0.

• The entire x87 FPU tag word is set to the valid state (00B in all tag fields).

• When an MMX instruction writes to an MMX register, it writes ones (11B) to the exponent part of the corresponding floating-point register (bits 64 through 79).

-Section 9.6.2, p.235 Intel Manual.

Maybe it's worth noting, when anything is loaded into these x87 data registers, they automatically get converted to double extended precision floating point format (p.194 Intel Manual). Just know when transitioning into MMX mode, all unused fpu bits are set to invalid values so that can cause floating point instructions to behave strangely.

*Either way, they are separate,* No, MM0..7 *do* alias the x87 regs. They're not separate if you're talking about x87. — Peter Cordes, Jul 11 '19 at 23:31
you are very correct, I apologize, my excuse is Figure 3-1 on p.65 is misleading! I will edit in a correction, thank you. — Robert Houghton, Jul 12 '19 at 00:51
When the manual says "separate", it means separate names. I think their language about aliasing FPU data is pretty clear. (If you know how the x87 stack works, where `st0..7` map onto the underlying R0..7 according to the top-of-stack counter. http://www.ray.masmcode.com/tutorial/fpuchap1.htm has a nice revolver-barrel analogy + diagram). Describing it this way makes sense for the purposes of explaining that legacy fsave / frstor is still sufficient to save/restore, fxsave/fxrstor that saves 32-bit mode XMM state as well isn't needed. i.e. MMX doesn't need OS support. — Peter Cordes, Jul 12 '19 at 01:09
You don't need FSAVE to transition between x87 and MMX. Simply using any MMX instruction will put the FPU in "MMX mode". To transition back to x87 mode (so x87 instructions won't fail), you just need `emms`. You wouldn't normally use `fsave`/`frstor` within a user-space process, just get any valuable data out of the regs before clobbering them. — Peter Cordes, Jul 12 '19 at 01:13
Revolver-barrel is right! j/k but that is a good way to think of the fpu registers function. The misunderstanding is my own, I think it grew from the idea that the MMX registers are referred to as 64 bit, but the fpu data registers are 80 bit. Regardless, thanks for pointing that out. The word alias in reference to mmx -> fpu is ALL over chapter 9 of the Manual. — Robert Houghton, Jul 12 '19 at 01:17
Yeah, like your later manual quote explains, the MMX regs alias the 64-bit significand of the 80-bit FPU regs. (I hadn't known that happened to the high bits. Setting them to all-ones creates a NaN bit-pattern. Or -Infinity if the significand is zero. That avoids a false dependency and maybe makes it easier to debug mistakes by giving NaN instead of subnormals from zero-extension, or just leaving whatever bits were there.) — Peter Cordes, Jul 12 '19 at 01:23

score 0 · Answer 3 · answered Nov 30 '13 at 21:25

0

MMX support is not usually written- I'd check for SSE support, because if there is a support of SSE that automatically means that MMX is supported.

answered Nov 30 '13 at 21:25

Simon

2,643
3
40
61

Do the MMX registers always exist in modern processors?

3 Answers3