Why is the cmath library is so slow in terms of rounding (round, ceil, floor, trunc)?
We are talking about a factor of 10 compared to SSE (roundsd, cvtsd2si) or good old FPU (FIST(P)), the latter being a bit a slower (20-25%), getting closer with rising clock frequency.
I've read an article by L de Soras, and his description is quite clear. The immediate parameter of rounds(p)d allows for selecting any possible schema. Checking the disassembly of round I could not detect any LDMXCSR command, just CVTTSS2SI (scalar conversion to int /w trunc).
So, why is there a 1000% longer wait on a really often needed functionality?