The floating-point encoding scheme

(C) Gustaf Alhäll, released under CC BY 4.0
Original document can be found at hanicef.me
License can be found at creativecommons.org

Encoding

The Floating-point encoding scheme is based on the scientific notation, and is divided into three parts: a sign bit, an exponent and a fraction (or mantissa), in that order. The sign bit is always 1 bit, but the exponent and fraction varies in size depending on what value size.

The scheme is based on this equation:

-1ᶳ * 2ᵉ⁻ᶻ * 1.f₁f₂f₃...fᵤ

Here, s represents the sign bit, e represents the exponent, z represents the exponent bias and f represents bits in the fraction, where u in the size of the fraction. Each of these values are stored in binary, and are divided into segments of a binary number. The sign segment is always 1 bit, but the exponent and fraction varies in size depending on the size of the floating-point value.

A 32-bit floating-point is segmented like this:

Sign Exponent Fraction
   - -------- -----------------------
   0 10000010 11101000000000000000000

This value represents 15.25, and has an exponent of 8 bits and a fraction of 23 bits. You might've noticed that exponent is actually 130 instead of just 3, which would be the correct scientific notation for representing 15.25. The reason for this is due to something called "exponent bias", which is the turn-over point from which a value is above or below 1. The exponent bias is defined as the most significant byte set to 0 and all other bits set to 1, and in the case of a 32-bit floating-poing value, it's the binary value 01111111, which is 127 in decimal.

A 64-bit floating-point value has an exponent of 11 bits and a fraction of 52. These two formats are the most common and are also standardized under the names single-precision and double-precision, respectively.

Special values

Some combinations of exponent and fraction of floating-point values has special meaning, and doesn't follow the rules specified above.

Subnormal values

Subnormal values occur when the exponent is filled with all 0's. This is a special case that exists only to make it possible to represent the value 0 in floating point. Instead of working on the normal formula, subnormal values are based on this formula:

-1ᶳ * 2ᵉ⁻ᶻ * 0.f₁f₂f₃...fᵤ

The only difference is that the fraction has a 0 instead of a 1 at the one's place. Without this rule, a floating-point value filled with 0's would not be 0, but rather a very small value. This is apparent when performing math on a value like this using the standard formula:

-1⁰ * 2⁰⁻¹²⁷ * 1 = 1 * 2⁻¹²⁷ = 2⁻¹²⁷ ≈ 5.87747e-39 ≠ 0

Counter-intuitively, this becomes an incredibly small value, but is still not 0. Instead, what happens with subnormal values is this:

-1⁰ * 2⁰⁻¹²⁷ * 0 = 0

Note that we are multiplying with a 0 instead of a 1 here, which makes it possible to represent 0.

Negative zero

This isn't really an exception to the ordinary floating-point rules, but rather the consequence to how the encoding works. Basically, this happens when the sign bit is set and all other bits are 0, resulting in the value -0.

Negative zero behaves mostly like zero, except in a few rare cases. In particular, these rules apply:

0 + 0 = 0
0 - 0 = 0
0 + (-0) = 0
0 - (-0) = 0
(-0) + 0 = 0
(-0) - 0 = -0
(-0) + (-0) = -0
(-0) - (-0) = 0

Basically, the result is -0 when both operands are -0. While this might seem confusing, consider this:

(-0) - 0 = (-0) + (-0) = -0
(-0) - (-0) = (-0) + 0 = 0

This is why the table seems so inconsistent; it takes addition in consideration when deciding if a value should be negative or positive, and subtraction is treated as the inverse of addition.

Infinity

Infinity occurs when the exponent is filled with 1's and the fraction is 0, and is used to protect against overflows. Most operations with infinity results either in infinity or NaN.

Infinity was introduced as a means to prevent to overflows from happening. Overflows occur when a value reaches it's limit and wraps around to the negated side of the value. For floating-point, when a value reaches it's limit, instead of wrapping to it's negative equivalent, it instead results in an infinite value.

Infinity can be positive and negative, unlike infinity in mathematics where it doesn't have a sign. The sign of the value is based on the direction that the floating point was converging to; if it was going towards the negative minimum, it results in a negative infinity, and if it was going towards the positive maximum, it results in a positive infinity.

Unfortunately, infinity has a major issue: it's infectious. Most operations with infinity results in infinity, too. This can potentially break a lot of things, as developers don't add safeguards against infinity (which is reasonable; infinity shouldn't occur in software for the same reason overflows shouldn't). This causes algorithms in software to calculate things incorrectly, and as a result, it almost always results in software bugs.

NaN

Not-A-Number, or NaN for short, occurs when all bits in the exponent is set to one and at least one bit in the fraction is set to 1, and is the result of an invalid operation.

NaN was originally introduced as an alternative to stopping or interrupting the execution process of a program. Instead of causing the process to fail on an invalid operation, the operation would yield a NaN instead and allow the program to continue. All subsequent operations involving NaN would consequently yield another NaN.

Unlike any other value in floating-point, NaN inhibits some weird and counter-intuitive properties. For example, it is the only value that is not equal to itself; any comparison involving a NaN will always be false. Additionally, it's also the only value that cannot be negative; for NaN values, the sign bit is ignored and is not considered with operations involving NaN.

NaN also have two forms: quiet NaN (qNaN), where the most significant bit in the fraction is set to 1, and signaling NaN (sNaN), where the most significant bit in the fraction is set to 0. The key difference between the two is that an sNaN produces an hardware exception when any operation is performed with the value, while a qNaN is silently ignored. Note that an hardware exception is not the same thing as an exception in high-level programming languages; hardware exceptions are handled by the kernel and are usually ignored!

NaN also has the issue of being infectious, just like infinity, in that operations involving NaN results in another NaN. This is especially problematic for qNaNs due to the fact that they do not raise an exception when operations are performed on them.

Mathematics

Addition and subtraction

Addition and subtractions requires the two values to have the same exponent, which can simply be done with shifting. The steps required are:

Take the value with the smallest exponent and change the value so both values have the same exponent.
Perform addition or subtraction on the fractions.
Change the exponent of the resulting value so the most significant bit in the fraction is right before the dot.

For example, to calculate 23.375 + 14.75 and 23.375 - 14.75:

23.375 = 0_10000011_01110110000000000000000₂ = 1.0111011₂ * 2⁴
14.75  = 0_10000010_11011000000000000000000₂ = 1.11011₂ * 2³

(1.0111011₂ * 2⁴) + (1.11011₂ * 2³) = (1.0111011₂ * 2⁴) + (0.111011₂ * 2⁴) = (10.0110001₂ * 2⁴)
10.0110001₂ * 2⁴ = 1.00110001₂ * 2⁵ = 0_10000100_00110001000000000000000₂ = 38.125

(1.0111011₂ * 2⁴) - (1.11011₂ * 2³) = (1.0111011₂ * 2⁴) - (0.111011₂ * 2⁴) = (0.1000101₂ * 2⁴)
0.1000101₂ * 2⁴ = 1.000101₂ * 2³ = 0_10000010_00010100000000000000000₂ = 8.625

When implementing this in practice, what's important to keep in mind is that the multiplication with powers of 2 can be replaced with bit shifts, where the exponent is the amount of steps to shift. Likewise, the dot in the fraction can be ignored when performing the actual addition or subtraction; the result will be the same regardless if the dot is there or not as long as the exponents are aligned.

Multiplication

Multiplication can be done by adding the exponents together and multiplying the fractions from that. The steps required are:

Add the exponents together.
Multiply the fractions together.
Change the exponent of the resulting value so the most significant bit in the fraction is right before the dot.

For example, to calculate 23.375 * 14.75:

(1.0111011₂ * 2⁴) * (1.11011₂ * 2³) = (1.0111011₂ * 1.11011₂) * 2⁴⁺³ = (10.101100011001₂ * 2⁷)
10.101100011001₂ * 2⁷ = 1.0101100011001₂ * 2⁸ = 0_10000111_01011000110010000000000₂ = 344.78125

The easiest way to implement the multiplication in software is by using the fixed-point scheme. This works by simply pretending there's a dot in a value, and performing simple algebra to compensate for that dot. In this case, the fraction can be multiplied this way:

(1.01110110000000000000000₂ * 1.11011000000000000000000₂) >> 23 = 10.10110001100100000000000₂

Note that we are only pretending that the dot is there; it's not actually there and the shift exists to compensate for the multiplication. For 32-bit floating-point, the shift is 23 as that's the size of the fraction.

Finally, the sign bit can be calculated by simply XOR-ing the sign bit of the values. This eliminates the need of getting the sign involved when multiplying.

Limitations

While floating-point is considered the de-facto standard for encoding decimal values in software, it's important to know the shortcomings of the scheme.

The biggest flaw is accuracy. While floating point can represent small values incredibly accurately, it's accuracy diminishes as a value gets larger. By the time that the exponent has passed the total amount of bits in the fraction, whole numbers cannot be represented anymore. For an example, the integer 4,269,801,473 (10000000 00000000 00000000 00000001) cannot be represented in a 32-bit floating point, as it requires an exponent so large that the fraction can no longer reach the last binary 1 in the integer.

When these inaccuracies occur, floating-point values are then forced to round off values. These roundings, while often insignificant, can cause incorrect results if an operation relies on accurate calculations. An example of where this can become an issue is when calculating distances in space. Because of the huge distance between planets, floating-point might be off by meters from the actual distance between planets. A typical example is the average distance between the sun and the earth, which is 149,597,870 km. This number cannot be represented accurately in 32-bit floating-point value due to it's size, and is off by 2 kilometers (149,597,872).

Another issue is the use of base 2. In order to make it possible to implement floating-point in hardware efficiently, the encoding scheme was based on binary rather than decimal. This could pose a problem when trying to represent values in decimal, as some decimal values cannot be represented is binary due to some fractions having infinite amount of digits when converted to base 2. A typical and surprising side effect of this is 0.3 + 0.4 ≠ 0.7, which just looks completely wrong, and is a consequence of the fact that 0.3, 0.4 and 0.7 is 0.01001100110..., 0.01100110011... and 0.10110011001... respectively in binary.