Understanding Floating-Point Values

Last modified by Microchip on 2024/01/23 16:44

Floating-Point Values

Floating-point types are often used in programs to hold real numbers. However, there is a lot of misunderstanding about how they are implemented on a digital system, resulting in code that does not work as expected.

If you were to work out by hand the result of the fraction 4/3 in a decimal number form, you will (hopefully) arrive at the result 1.33333333333 (recurring). If you use a pocket calculator, the result will be the same and displayed to as many digits available on that device's display. So, it can come as quite a shock when you discover that assigning this same fraction to a floating-point variable in your C program might store the value as 1.33333337307, a seemingly poor approximation to the correct value. The critical point to remember when using floating-point values on a digital system is that although they are intended to represent any value to any number of significant digits, these values are stored using a finite number of bits and hence, they have limitations. Unlike integers which only have limitations on their range, floating-point numbers also have limitations on their precision.

Mantissa and Exponent Components

Digitally stored floating-point numbers consist of two components: the mantissa (or significant) and the exponent, with the value given by the following expression:

value = mantissa x baseexponent

In mathematical calculations, the base is often 10 but when these numbers are stored digitally, the base is usually two. Exactly how these components are encoded is not important for this discussion but the number of bits allocated to each component affects the range and precision of the values that can be represented. Using more bits to encode the mantissa allows the value to be represented more precisely, more bits in the exponent means that a larger range of values can be represented.

Back to Top

Floating-Point Values in Compliers

Compilers commonly use an IEEE representation for floating-point values which specifies the number of bits to use for the mantissa and exponent but there are different formats within this standard. The single-precision format uses a total of 32 bits to represent a floating-point number and consists of eight bits of exponent and 24 bits of mantissa (including one sign bit). This format is supported by all three MPLAB® XC compilers. The double-precision format uses a total of 64 bits to store a floating-point number and consists of 11 bits of exponent and 53 bits of signed mantissa. This format is supported by MPLAB XC16 and XC32. By default, the XC8 compiler uses a 24-bit floating-point format that is a truncated form of the 32-bit format and that has eight bits of exponent but only 16 bits of signed mantissa.

The larger IEEE formats allow precise numbers, covering a large range of values to be handled. However, these formats require more data memory to store values of this type and the library routines that process these values are very large and slow. Floating-point calculations are always much slower than integer calculations and should be avoided if at all possible, especially if you are using an 8-bit device. This page indicates one alternative you might consider.

Back to Top

Rounding

If you do use floating-point values in your code, remember that even floating-point constants are rounded. For example, if you are using a 32-bit floating-point format and you want to assign the value 128.0 to a variable with such a type that value can be represented exactly by that variable. However, the next largest number that can be assigned is 128.000015259. If you were to try to assign to the variable a value that fell between these two values, it would be rounded up or down. The larger the magnitude of the value, the more widely spaced become the exactly-representable values. So, the value 8000000 can be represented exactly by a 32-bit floating-point type but the next highest value is 8000000.5.

The following table shows the 32-bit examples we have seen above and the bit sequences that are used to represent those values. These sequences have been split into the exponent and mantissa, so you can see how the similar values differ only by one bit in the mantissa. Again, how these components are encoded is not important when it comes to using these values in a program.

ValueBit SequenceMantissaExponent
128.00x430000000x0000000x07
128.0000152590x430000010x0000010x07
8000000.00x4af424000x7424000x16
8000000.50x4af424010x7424010x16

Consider the following code which assigns one value to a floating-point variable, then immediately compares that variable with a different value. Although the two floating-point constants that appear in the source code are different, they compare as equal when this code is executed.

1
2
3
4
double myFloat = 8000000.0; // 32-bit doubles

if(myFloat == 8000000.2)     // value will be rounded
 doSomething();             // this line will be executed

If you are using the 24-bit format with MPLAB XC8, you lose precision and the discrepancy between the intended and actual values can become large. The execution of the function call in the following example would make it seem like something has gone wrong but this is just a manifestation of rounding.

1
2
3
4
double myFloat = 8000000.0; // 24-bit doubles

if(myFloat == 8000061.0)     // value will be rounded
 doSomething();             // this line will be executed

Here is how 8000000.0 and the next highest representable 24-bit value are encoded:

ValueBit sequenceMantissaExponent
8000000.00x4af4240x74240x16
8000128.00x4af4250x74250x16

Beware of operations that involve two floating-point values with a different magnitude. The following code appears to add the value 0.2 to myFloat 100 times.

1
2
3
4
5
6
double myFloat = 8000000.0; // 32-bit doubles
int cnt = 100;

while (cnt--) {       // this entire loop will have no effect
   myFloat += 0.2;
 }

It is tempting to think that the result should be 8000020.0 but that is not the case. Certainly, the value 0.2 will be rounded but the actual value will still be quite close to the intended value. The problem here is that the value added is not large enough to shift the larger operand to the next representation. The result is that each addition leaves the value of myFloat unchanged and the entire loop has no effect.

Be aware that complex floating-point algorithms are used to perform seemingly simple operations such as addition and multiplication and that rounding can take place during these calculations. Even printing a floating-point value requires a complicated conversion to a string of decimal digits. Never expect the results of complex calculations to exactly match the theoretical result as determined by pen and paper or a high-precision pocket calculator. If you must check the result of a calculation in a program, you might need to ensure that the result is within a range of values rather than being exactly equal to the expected value.

Back to Top