Assuming the maximum possible value is less than 1E154, and thus all values fit in 512 bits, than I suspect the answer might be:
- Precalculate all possible powers_of_10in a static const array. (#ops ~0)
- If all numbers fit in 2 YMM registers, that means the max supported value is 1E154, which means the lookup table of powers of 10 would take ~9856 bytes.
 
- In your number, calculate the number of leading binary zeroes (Using clzas well) (#ops < ~10)
- Use (max-bits - number_of_leading_zeroes) / 3.32192809489to give a good estimate of the final number of decimal digits. This is also a good estimate a close power of 10. (#ops ~2)
- Iterate the powers_of_10from that estimate until you find the largest power-of-10 less than your value. (#ops ~8)
- If you're willing to sacrifice accuracy in the exponent, then this step can be skipped.
 
- Add that power of 10 to itself in a loop until greater than your input. (#ops < ~100) (10 additions of 10 uint64s)
- If you're willing to sacrifice minor accuracy in the mantissa, then double(input)/double(power_of_ten)will do it in a single division.
- There's lots of shortcuts that you can take in the bigint->double conversion.
 
- Emit loop_count-1Epower_of_ten_index. (#ops ~4)
If you're willing to sacrifice accuracy in both exponent and mantissa, that leaves ~16 operations that completely ignore the low bits.
Performance
Without writing out a final implementation, performance is hard to guess at, and with a large LUT, cache, and therefore the rest of the program becomes a factor, but here's preliminary data: https://quick-bench.com/q/53k-xSQz7y4iCO7ny66Dz2w62dQ (I ran it several times to try to eliminate outliers)
In my tests, the fastest combination seems to (unsurprisingly) be:
- Do not iterate the powers_of_tento confirm the exponent.
- Do not exactly calculate the mantissa.
- Use floats to guess the mantissa (preferring speed to accuracy)
From this baseline, we can see:
- Iterating the powers_of_tenseems to have no noticeable impact on time on average, as long as you also use floats to guess the mantissa. If you do not use floats to guess the mantissa, then the mantissa calculation will take significantly longer. This implies it is not adding significant accuracy on average, and can be skipped to minimize code size.
- Using floats to guess the mantissa seems to make the algorithm ~5% faster on average, with no impact on accuracy.
- Iterating to find the exact mantissa slows the algorithm about 16%, but also implies that it's adding accuracy, so I assume you want to keep that.
- I had two variants of my bigint->double conversion code when estimating the mantissa. One version only ensured the double had at least 1 significant bit minimum, and the other version ensured that the double had >4 significant bits. The additional code to ensure more significant bits had no noticeable impact on time on average, so the additional code/accuracy at that step did not result in noticeably more performance in finding the exact mantissa, and I assume you want to skip that, to minimize code size.
It is also interesting to note that all of these are ~4x faster than converting the bigint to a double and then using printf("%.0E",value);
(I do not vouch for the accuracy of the results of any of this code)