An even a simpler modification to the approach discussed in comments:
Multiply by 1000 and convert that to integer with fistp (with the default rounding to nearest), instead of just rounding to an integer-valued long double using frndint.
The low 3 decimal digits of that integer are the fractional part of your number.  i.e. you now have decimal fixed-point.  div by 1000 gives you quotient (integer part) and remainder (fractional part).  Print both parts with a . between them.
You'll want to do manual int->string conversion (How do I print an integer in Assembly Level Programming without printf from the c library?) or otherwise print leading zeros in the fractional part.  (So 2.062 doesn't turn into 2.62)
This is easier than separating into integer and fractional parts in FP, which would require rounding with truncation toward zero to make sure you got a non-negative fractional part.  Integer division naturally truncates towards zero, but legacy x87 FP->int conversion can only use the default rounding mode.  (Except with SSE3 fisttp.)  SSE1/2 had XMM FP->int conversions with truncation or current rounding mode since they were introduced, like cvttsd2si vs. cvtsd2si
Downside: overflows a 32-bit integer for smaller float inputs, because a single 32-bit integer has to hold x * 1000.
The other way is to use x - (int)x to get the fractional part and only multiplying that fractional part by 1000.0.  That leads to (int)x in a separate integer from the the fractional part, with x*1000 only existing as floating point, not int32_t.
Fun fact: 
AVX512DQ has an instruction for getting the fractional part directly: VREDUCESD xmm1, xmm2, xmm3/m64, imm8 (and ss/ps/pd versions). It's the part that SSE4 roundsd / vrndscalesd would discard when keeping the integer part.  Even more fun: can keep a specified number of fraction bits. But of course those are binary fraction bits, not decimal places. 
Most x86 CPUs have SSE4.1 these days, but only Skylake-X high-end desktops and modern Xeons have AVX512DQ. :/  And Ice Lake laptops.