Using x87 from a kernel module will "work", but silently corrupts user-space x87 / MMX state. Why am I able to perform floating point operations inside a Linux kernel module?
You need kernel_fpu_begin() / kernel_fpu_end() to make this safe.
Instead of loading/storing from inline asm, ask for input and produce output on the top of the x87 register stack and let the compiler emit load/store instructions if needed. The compiler already knows how to do that, you only need to use inline asm for the sqrt instruction itself, which you can describe to the compiler this way:
static inline
float sqroot(float arg) {
asm("fsqrt" : "+t"(arg) );
return arg;
}
(See the compiler-generated asm for this on the Godbolt compiler explorer)
The register constraints have to tell the block to use the floating point registers.
Or avoid inline asm entirely, using a GNU C builtin that can inline
You need to use -fno-math-errno for the builtin to actually inline as fsqrt or sqrtss, without a fallback to call sqrtf for inputs that will result in NaN.
static inline
float sqroot_builtin(float arg) {
return __builtin_sqrtf(arg);
}
For x86-64, we get sqrtss %xmm0, %xmm0 / ret while for i386 we get fld / fsqrt / ret. (See the Godbolt link above). And constant-propagation works through __builtin_sqrt, and other optimizations.
EDIT: Incorporating @iwillnotexist-idontexist's point (re double loading).
Also, if it were me, I'd add static inline to the declaration and put it in a header file. This will allow the compiler to more intelligently manage registers and avoid stack frame overheads.
(I'd also be tempted to change float to double throughout. Otherwise, you're discarding the additional precision that is used in the actual floating point instructions. Although if you will end up frequently storing the values as float, there will be an additional cvtpd2ps instruction. OTOH, if you're passing arguments to printf, for example, this actually avoids a cvtps2pd.)
But Linux kernel kprintf doesn't have conversions for double anyway.
If compiled with -mfpmath=387 (the default for 32-bit code), values will stay in 80-bit x87 registers after inlining. But yes, with 64-bit code using the 64-bit default of -mfpmath=sse this would result in rounding off to float when loading back into XMM registers.
kernel_fpu_begin() saves the full FPU state, and avoiding SSE registers and only using x87 won't make it or the eventual FPU restore when returning to user-space any cheaper.