Storing non-negative floating point values

Question

Is there an efficient way to store non-negative floating point values using the existing float32 and float64 formats?

Imagine the default float32 behaviour which allows negative/positive:

val = bytes.readFloat32();

Is it possible to allow for greater positive values if negative values are not necessary?

val = bytes.readFloat32() + 0xFFFFFFFF;

Edit: Essentially when I know I'm storing only positive values, the float format could be modified a bit to allow for greater range or precision for the same amount of bits.

Eg. The float32 format is defined as 1 bit for sign, 8 bits for exponent, 23 bits for fraction

What if I don't need the sign bit, can we have 8 bits for exponent, 24 bits for fraction to give greater precision for the same 32 bits?

Are you thinking they are like ints which can be made into unsigned ints and get another bit? I don't think so...or it would have been done already. — Steve Wellens, Apr 14 '12 at 15:49
Tell us more about your problem. Why do you want to do this? You may be asking the wrong question. — TonyK, Apr 14 '12 at 15:57

score 3 · Answer 1 · answered Apr 14 '12 at 15:47

3

Floating-point numbers (float32 and float64) have an explicit sign bit. The equivalent of unsigned integers doesn't exist for floating-point numbers.

So there is no easy way to double the range of positive floating-point numbers.

answered Apr 14 '12 at 15:47

Jeffrey Sax

10,253
3
29
40

There are no `float32` or `float64` types in C++ – phuclv Oct 16 '22 at 14:11

justin · Answer 2 · 2012-04-14T16:12:56.637

No, not for free.

You can extend the range/accuracy in many ways using other numeric representations. The intent won't be clear, and the performance will typically be poor if you want the range and accuracy of float or double using another numeric representation (of equal size).

Just stick with float or double unless performance/storage is very very important, and you can represent your values well (or better!) using another numeric representation.

phuclv · Answer 3 · 2022-03-11T16:57:40.573

There's almost no support for unsigned float in hardware so you won't have such off-the-shelf feature but you can still have quite efficient unsigned float by storing the least significant bit in the sign bit. This way you can utilize the available floating-point hardware support instead of writing a software float solution. To do that you can

manipulate it manually after each operation

This way you need some small correction to the lsb (A.K.A sign bit), for example 1 more long division step, or a 1-bit adder for the addition
or by doing the math in higher precision if available

For example for float you can do operations in double then cast back to float when storing if sizof(float) < sizeof(double)

Here's a simple PoC implementation:

#include <cmath>
#include <cfenv>
#include <bit>
#include <type_traits>

// Does the math in double precision when hardware double is available
#define HAS_NATIVE_DOUBLE

class UFloat
{
public:
    UFloat(double d) : UFloat(0.0f)
    {
        if (d < 0)
            throw std::range_error("Value must be non-negative!");
        uint64_t dbits = std::bit_cast<uint64_t>(d);
        bool lsb = dbits & lsbMask;
        dbits &= ~lsbMask; // turn off the lsb
        d = std::bit_cast<double>(dbits);
        value = lsb ? -(float)d : (float)d;
    }

    UFloat(const UFloat &rhs) : UFloat(rhs.value) {}

    // =========== Operators ===========
    UFloat &operator+=(const UFloat &rhs)
    {
#ifdef HAS_NATIVE_DOUBLE
        // Calculate in higher precision then round back
        setValue((double)value + rhs.value);
#else
        // Calculate the least significant bit manually
        
        bool lhsLsb = std::signbit(value);
        bool rhsLsb = std::signbit(rhs.value);
        // Clear the sign bit to get the higher significant bits
        // then get the sum
        value = std::abs(value);
        value += std::abs(rhs.value);
        if (std::isfinite(value))
        {
            if (lhsLsb ^ rhsLsb) // Only ONE of the 2 least significant bits is 1
            {
                // The sum's lsb is 1, so we'll set its sign bit
                value = -value;
            }
            else if (lhsLsb)
            {
                // BOTH least significant bits are 1s,
                // so we'll add the carry to the next bit
                value = std::nextafter(value, INFINITY);
                // The lsb of the sum is 0, so the sign bit isn't changed
            }
        }
#endif
        return *this;
    }

    UFloat &operator*=(const UFloat &rhs)
    {
#ifdef HAS_NATIVE_DOUBLE
        // Calculate in higher precision then round back
        setValue((double)value * rhs.value);
#else
        // Calculate the least significant bit manually
    
        bool lhsLsb = std::signbit(value);
        bool rhsLsb = std::signbit(rhs.value);

        // Clear the sign bit to get the higher significant bits
        // then get the product
        float lhsMsbs = std::abs(value);
        float rhsMsbs = std::abs(rhs.value);

        // Suppose we have X.xPm with
        //     X: the high significant bits
        //     x: the least significant one
        // and m: the exponent. Same to Y.yPn
        // X.xPm * Y.yPn = (X + 0.x)*2^m * (Y + 0.y)*2^n
        //               = (X + x/2)*2^m * (Y + y/2)*2^n
        //               = (X*Y + X*y/2 + Y*x/2 + x*y/4)*2^(m + n)
        value = lhsMsbs * rhsMsbs; // X*Y
        if (std::isfinite(value))
        {
            uint32_t rhsMsbsBits = std::bit_cast<uint32_t>(rhsMsb);
            value += rhsMsbs*lhsLsb / 2; // X*y/2
            
            uint32_t lhsMsbsBits = std::bit_cast<uint32_t>(lhsMsbs);
            value += lhsMsbs*rhsLsb / 2; // Y*x/2
            
            int lsb = (rhsMsbsBits | lhsMsbsBits) & 1; // the product's lsb
            lsb += lhsLsb & rhsLsb;
            if (lsb & 1)
                value = -value; // set the lsb
            if (lsb > 1)    // carry to the next bit
                value = std::nextafter(value, INFINITY);
        }
#endif

        return *this;
    }
    
    UFloat &operator/=(const UFloat &rhs)
    {
#ifdef HAS_NATIVE_DOUBLE
        // Calculate in higher precision then round back
        setValue((double)value / rhs.value);
#else
        // Calculate the least significant bit manually
        // Do just one more step of long division,
        // since we only have 1 bit left to divide

        throw std::runtime_error("Not Implemented yet!");
#endif

        return *this;
    }

    double getUnsignedValue() const
    {
        if (!std::signbit(value))
        {
            return value;
        }
        else
        {
            double result = std::abs(value);
            uint64_t doubleValue = std::bit_cast<uint64_t>(result);
            doubleValue |= lsbMask; // turn on the least significant bit
            result = std::bit_cast<double>(doubleValue);
            return result;
        }
    }
    
private:
    // The unsigned float value, with the least significant bit (lsb)
    // being stored in the sign bit
    float value;
    
    // the first bit after the normal mantissa bits
    static const uint64_t lsbMask = 1ULL << (DBL_MANT_DIG - FLT_MANT_DIG - 1);

    // =========== Private Constructor ===========
    UFloat(float rhs) : value(rhs)
    {
        std::fesetround(FE_TOWARDZERO); // We'll round the value ourselves
#ifdef HAS_NATIVE_DOUBLE
        static_assert(sizeof(float) < sizeof(double));
#endif
    }

    void setValue(double d)
    {
        // get the bit pattern of the double value
        auto bits = std::bit_cast<std::uint64_t>(d);
        bool lsb = bits & lsbMask;

        // turn off the lsb to avoid rounding when converting to float
        bits &= ~lsbMask;
        d = std::bit_cast<double>(bits);

        value = (float)d;
        if (lsb)
            value = -value;
    }
}

Some more tuning may be necessary to get the correct lsb

Either way you'll need more operations than normal so this may only be good for big arrays where cache footprint is a concern. In that case I suggest to use this as a storage format only, like how FP16 is treated on most current architectures: there are only load/store instructions for it which expands to float or double and convert back. All arithmetic operations are done in float or double only

So the unsigned float should exist in memory only, and will be decoded to the full double on load. This way you work on the native double type and won't need the correction after each operator

Alternatively this can be used with SIMD to operate on multiple unsigned floats at the same time

Storing non-negative floating point values

3 Answers3

Linked