SSE1 implies MMX, so yes supporting x86-64 guarantees MMX (because SSE2 is baseline for x86-64).
They alias the 80-bit x87 regs, not the general-purpose integer registers! Long mode doesn't change anything about how MMX works.
All modern CPUs are 64-bit capable and thus have MMX available in all modes. Even 32-bit only embedded AMD Geode CPUs have MMX (but not SSE).
It's pretty rare that MMX is worth using when you have 16x XMM regs + 16x 64-bit GP regs. Store/reload is not terrible, especially if the reload can use a memory source operand.
Extra ALU uops to move data to/from MMX regs is usually not worth it vs. store/reload. Reload can often be micro-fused as a memory source operand, and the ALU execution port pressure can easily be a problem.
If you were doing something special with cache disabled then sure, but normally store-forwarding makes store/reload efficient if you can keep it off the critical path. (It does have ~5 cycle latency).
If you do want to move data between XMM and GP regs, though, typically movd / movq or pinsrd / pextrd are a good choice, not store/reload. I'm saying that a spill/reload of a GP or XMM reg in an outer loop is usually better than 2x movq or movq2dq xmm0, mm0.
In fact on Skylake, one movq2dq costs 2 uops. Same for movdq2q. (movq to/from GP regs is still only 1 uop, though, with the same port 0 or port 5 limitation as transfers between XMM and GP regs).
Plus, using MMX in a function costs you an emms instruction at the end of it (or before any function call if you want to be ABI compliant). The MMX regs are all call-clobbered in normal calling conventions (and in fact the FPU has to be in x87 state instead of MMX state).
MMX is definitely not as efficient as XMM on modern CPUs. Actually using it for anything other than storage is usually worse than SSE2 (with movq loads/stores and ignoring the high bytes of XMM regs, if you want to work in 64-bit chunks).
For example, on Intel/AMD CPUs with mov-elimination for movaps xmm,xmm, MMX register-copy with movq xmm1, xmm0 still costs an ALU uop and still has 1 cycle of latency. (Both still cost a uop for the front-end; mov-elimination only removes the latency and back-end cost other than the ROB entry.)
Also, Skylake has better throughput for the XMM version of some instructions than for the MMX version. e.g. paddb/w/d/q mm,mm runs on p05, but paddb/w/d/q xmm,xmm runs on p015. Many other operations, like pavg*, pmadd*, and shifts, can run on p01 for XMM regs, but only port 0 for MMX regs. (https://agner.org/optimize/)
So like x87 FPU, it's still supported for legacy code, but it has fewer execution units that support it. It's not terrible yet, so software like x264 and FFmpeg that still have significant amounts of MMX code for stuff that natural works in 64-bit chunks don't suffer too badly.
128-bit AVX versions of integer instructions would probably be the best bet in many cases to avoid register-copy mov instructions.