Those who use and write programs that use floating-point to solve problems have a common problem. Floating-point numbers and operations are expected to behave like real numbers and operations. This quite reasonable expectation is, unfortunately, never true. All floating-point systems differ from the real number system in most of their basic assumptions.
In this paper, these differences will be explored. The axiomatic basis for the real number system will be presented with a system of 18 axioms. Examples of how most floating-point systems violate most of these axioms will then be shown. There will be particular emphasis on the behavior of the IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Std 754-1985.
What are the properties of the real number system? In a nutshell, the real number system is an uncountably dense ordered field. That means, among other things, that it is a group over addition, a group over multiplication, it is ordered, it is dense, and there are uncountably many real numbers. We will look at each of these properties in turn.
Properties of Addition. In order to be a group over addition, a number system must obey the following axioms.
Properties of Multiplication. The real numbers (excluding zero) also form a group over multiplication.
In addition to these axioms, addition and multiplication together must obey another axiom.
The axioms listed so far define a field. There are many different fields that meet these axioms that are quite different from real numbers. In particular, the complex numbers are also a field.
So, how are complex numbers different from real numbers? The simplest feature that distinguishes these two types of fields is that the real numbers are ordered and the complex numbers are not.
Properties of Relations. A field is said to be ordered (actually, totally ordered, in the case of real numbers) if one can define the relations < (less-than), > (greater-than), and = (equal-to) so that they have the following properties.
Relations also have the following properties with respect to addition and multiplication.
The Property of Density. The real numbers have another important property. They are dense.
We now have a set of axioms that describes both the rational number system and the real number system. We must now come up with a property that the reals (numbers like 5, 1/3, Sqrt(2), pi) have that the rationals (5 and 1/3) don't. It is an obscure one that is related to the limits of infinite series.
The Property of Uncountability. There are a great many more real numbers than rationals. This property has come to be known as uncountability.
For example, if S is the set
{3, 3.1, 3.14, 3.141, 3.1415,...}
every element in the set is a rational number. Also, 4 is an upper bound for that set since 4 is greater than any number in the set. 3.2 is also an upper bound for the set. But the least upper bound is the number pi, which is not a rational number. In this way, irrational numbers show up as the limits of a series of rational numbers.
A number system that meets the above axioms is sufficiently rich to allow us to solve some simple equations. For example the equation
ax+b = 0
has the solution
x = -b/a
by use of these axioms. To illustrate how much we use these axioms without being aware of them, lets go through the proof.
ax+b = 0 given. (ax+b)+(-b) = 0+(-b) by axioms 1 and 15. ax+(b+(-b)) = -b by axioms 2 and 4. ax+0 = -b by axiom 3. ax = -b by axiom 2. a-1(ax) = a-1(-b) by axioms 6 and 16. (a-1a)x = -ba-1 by axioms 9 and 10. 1x = -b/a by axiom 8 and the definition of divide. x = -b/a by axiom 7.
Similarly, systems of linear equations in many unknowns can also be solved.
But in order to solve quadratic equations we should define the square root operation.
With this definition we can now solve equations of the form
ax2+bx+c = 0
using the quadratic formula.
One might add more definitions to cover cubic or higher order equations, infinite series or products, and differential equations. But this system is adequate to demonstrate the difference between real numbers and floating-point numbers.
Floating-point numbers have become, over the years, the way in which real numbers are represented on computers. This representation varies from machine to machine but has properties common to all.
All floating-point numbers may be broken into three fields: the sign; the exponent; and the mantissa. The latter two fields contain values in positional notation in some base b. Typical values for b are 2, 10, and 16 but other bases have been used. IEEE-754 uses b=2.
These fields have the following meanings.
The number zero often presents a problem for this field. Indeed, many floating-point systems have both +0 and -0. IEEE-754 is one such system.
In IEEE-754 there are 8 bits in the single precision exponent field biased by the constant -127. In double precision, there are 11 bits biased by -1023.
We now have a system in which numbers are represented in a way which is analogous to scientific notation. For example, if b=10, the numbers .000002, -3, 137.03604, 186282.396, and -4000000 can now be represented as 2×10-6, -3×100, 1.3703604×102, 1.86282396×105, and -4×106, respectively.
These representations are not unique. For example, the number 137.03604 can be represented equally well by 1.3703604×102, 137.03604×100, or .0013703604×105. The concept of normalization is thus introduced to eliminate this redundancy.
A number is said to be normalized if the value of the mantissa field, m, is in the range 1<=m<b. Since it is always possible to normalize any number other than zero, most floating-point systems deal only with normalized numbers. (In a b's-complement system we want -b<=m<-1 for negative m.)
If b=2, the requirement that the mantissa be normalized implies that the most significant bit will always be one. If a bit is always one it need not be stored to be known. This is known as a `hidden bit'. Some special provision must be made for the value zero but in all other cases this increases the effective size of the mantissa by one bit.
IEEE-754 has 23 bits for the mantissa field in single precision and 52 bits in double. IEEE-754 uses a `hidden bit' to increase the effective size of this field to 24 and 53 bits, respectively.
Most floating-point systems use a fixed finite amount of space in which to store their numbers. Therefore, such systems can only represent a finite number of values. Since there are infinitely many real numbers, most of them will not be exactly representable as a floating-point number. The error thus introduced in trying to represent these numbers is thus called representation error. This error takes basically two forms.
For example, if b=10 and the mantissa is represented by five base-10 digits, the numbers 2.71828 and 2.71829 cannot be represented exactly. If, as is usually done, these numbers are both represented by the five digit number 2.7183, then they cannot be distinguished. This is often the first form of error introduced into a calculation.
In base-10, if the exponent is represented by two base-10 digits, the numbers 2×10-60 and 3×101000 cannot be represented. Only numbers whose absolute values are between 10-50 and 1050 can be represented.
Of these errors, the first is more common but usually less serious. When a number falls outside the exponent range, however, there is no `nearby' number that may be used instead. In most floating-point systems, if a number is too small to be represented, the number zero is substituted. If a number is too large, the largest representable number of the appropriate sign is often substituted.
Just as representation error happens when numbers cannot be accurately represented, so calculation error happens when operations cannot be accurately performed.
When a calculation is performed on floating-point numbers, the result is often not exactly representable as a floating-point number. This error, which derives from the finite nature of floating-point systems, also take two forms.
For example, the product of the two numbers 2.2002 and 2.0002 is 4.40084004. In a base-10 system with five mantissa digits an error of .00004004 will be committed by coercing the result to 4.4008.
IEEE-754 addresses both of these problems. It reserves the smallest exponent value to represent those values for which the hidden bit is zero. In this way, increasingly small numbers (including zero) can be represented with decreasing accuracy. These numbers, known as denormalized numbers allow for a phenomenon known as gradual underflow.
IEEE-754 also reserves the largest exponent value for special numbers. Two of these numbers are +infinity and -infinity. These numbers are used to represent values too large to otherwise be represented.
Given everything we now know about both floating-point numbers and real numbers we can now show some of the fallacies involved in thinking that they behave alike. We will do this by going over the axioms of the real number system one by one and showing examples of how a sample floating-point system behaves for each axiom.
The sample floating-point system we will use is a base-10 system with five mantissa digits and two exponent digits but the examples will have analogs in any floating-point system. We will also occasionally allow the sample floating-point system to have denormalized numbers for its least exponent and the two numbers +infinity and -infinity. This will allow us to illustrate how IEEE-754 behaves in certain circumstances.
Properties of Addition. How do floating-point systems behave with respect to addition?
In our sample system we find that
5.0002 + 5.0002 = 10.000
not 10.0004 as it should. Also,
5.0002×1049 + 5.0002×1049 = 9.9999×1049 or +infinity
not 1.00004×1050.
Signed magnitude systems (like IEEE-754) have both +0 and -0. In addition, most numbers in all floating-point systems are pseudo-zeros. These are numbers z such that x+z = z+x = x for some x.
For example, in our sample system, the number 1 is a pseudo-zero for any number larger than 106 since
1.0000×106 + 1 = 1.0000×106
not 1.00001×106.
They often have more than one inverse for some numbers. For example, let x = 1.5000×10-49 and x = -1.0000×10-49. Since the sum of these two numbers (5.0000×10-50) is too small to be represented in our sample system, we have that x + y = 0. Therefore, y is an additive inverse for x.
IEEE-754 doesn't suffer from this problem because of the existence of the denormalized numbers. This allows very small numbers to be represented with less than normal accuracy. But, it has the effect of allowing all sufficiently small add and subtract operations to be performed exactly.
If we allow our sample system to have denormalized numbers, the sum of these two numbers is representable as the number 0.5000×10-49. This allows all the additive inverses in the IEEE-754 system to be unique.
A mild example is
(3.1415 + 1000.1) - 1000.0
= 1003.2 - 1000.0
= 3.2000
whereas
3.1415 + (1000.1 - 1000.0)
= 3.1415 + .10000
= 3.2415.
Only three of the five digits in the mantissa were lost in this case. They can all be lost, as in
(.31415 + 10000.) - 10000.
= 10000. - 10000.
= 0.0000
vs.
.31415 + (10000. - 10000.)
= .31415 - 0.0000
= .31415.
This problem, in which a small error committed in one operation is amplified by another operation, is common to all floating-point systems. Sometimes the problem can be so severe, as in
(3.1415+1.0000×1040) - 1.0000×1040 = 0.0000
vs.
3.1415+(1.0000×1040 - 1.0000×1040) = 3.1415,
that even a higher precision (say 10 or 20 digits in the mantissa) will not make the error smaller.
It is easy to arrange for this to be true. The operands to an add or subtract operation must be compared anyway to determine which is the greater. That means that the operation is always effectively done as <greater>±<lesser> so the operand order doesn't matter.
Properties of Multiplication. The situation with respect to multiplication is similar to that for addition.
In our sample system we have
2.2002×2.0002 = 4.4008
not 4.40084004. Also,
2.2002×1030 × 2.0002×1035 = 9.9999×1049 or +infinity
not 4.40084004×1065.
On many systems, however, there are some numbers u near 1 such that ux = xu = x for some x.
In our system, if u = .99999 we have that ux = xu = x for all x in (1×10k,5×10k) for all k.
The number 1.0001 may use any number from .99990 to .99995 as an inverse but the number .99999 has none as
.99999×1.0000 = .99999
and
.99999×1.0001 = 1.0001.
In binary the situation is a little better but all systems, including IEEE-754, suffer from this problem.
This also implies that the definition of divide as x/y = x(y-1) fails to work when implemented in floating point. For example,
18.000/3.0000
= 18.000×.33333
= 5.9999
if we use .33333 as (3.0000)-1 and
18.000/3.0000
= 18.000×.33334
= 6.0001
if we use .33334 as (3.0000) -1 . As a result, divide must be implemented as a separately defined primitive if accurate results are desired.
Most floating-point systems implement divide as a separate primitive. An interesting exception is the Cray machine which implements divide as a reciprocation step followed by a multiply. Reciprocation can be accomplished by a table lookup and an adjustment that amounts to a multiplication or two. Thus, in a very high performance machine that is highly pipelined, a faster divide can be obtained in this fashion at the cost of some accuracy.
Except in cases where the intermediate result overflows or underflows, this error is usually small.
Consider
(1.0002×.99995)×.99995
= 1.0001×.99995
= 1.0000
vs.
1.0002×(.99995×.99995)
= 1.0002×.99990
= 1.0001.
The first has two nearly worst case rounding errors and the second has almost no error.
Again, an exception is the double precision floating-point multiply on the Cray machine. This multiply, in order to save time, doesn't calculate all the intermediate products necessary to deliver a completely rounded result. And since one operand (the multiplier) is treated differently from the other (it is Booth encoded), the result may vary depending on the order of the operands.
In our system we have
.99995×(1.0001 + 1.0001)
= .99995×2.0002
= 2.0001
but
.99995×1.0001 + .99995×1.0001
= 1.0000 + 1.0000
= 2.0000.
Properties of Relations. How the relations work on a given floating-point system often depends on whether a relation is considered an arithmetic operation or not.
For example, some floating-point systems determine if x>y by seeing if x-y>0. Since the subtract operation suffers from roundoff error, overflow and underflow, a relation that uses it will inherit these problems.
On the other hand, most floating-point systems allow relations to be determined by a corresponding set of integer relations on the fields of a floating-point number. Thus, to decide if x>y, first look at the signs. If that doesn't answer the question, look at the exponents. If that doesn't settle it, look at the mantissas. Relations in IEEE-754 may be stated in this way.
Assume relations are implemented as a subtract and there are no denormalized numbers in the system. If x = 1.5000×10-49 and y = 1.0000×10-49 then, since x-y = 0, we have that x=y is true.
Relations in the IEEE-754 system don't suffer this problem but they have another. There are some distinguished numbers in the system called NaNs (for Not-a-Number). These numbers are there primarily for diagnostic purposes. However, since they have no value, the result of comparing them with normal numbers is || (for unordered). That is, a NaN is not greater-than, nor less-than, nor equal-to any number. So, the best one can say about any two numbers in the IEEE-754 system is that either x<y, x>y, x=y, or x||y (read as x is unordered with respect to y).
The exception is a b's-complement system that uses subtraction to determine a relation. Since, in such a system there are some numbers that have no negatives, we may have that x<y but not y>x. (Or that x=y but not y=x.)
IEEE-754 has none of these problems and is reflexive.
In our sample system (without denormalized numbers) with
x = 1.0000×10-49
y = 1.5000×10-49
z = 2.0000×10-49
since y-x = 0, z-y = 0, and z-x = 1.0000×10-49, we have that x=y, y=z, but x<z if subtraction is used for relations.
IEEE-754 doesn't exhibit this problem and is transitive.
In some floating-point systems, such as our sample one, we may have any of the following happen:
x = 1.0000×10-49
y = 1.9999×10-49
z = 8.0006×10-49
gives x+z<y+z but not x<y as well as x=y but not x+z=y+z. Also,
x = 1
y = 2
z = 106
gives x<y but not x+z<y+z as well as x+z=y+z x+z=y+z but not x=y.
The best that can be said about a system with good arithmetic and good relations (such as IEEE-754) is
if x+z<y+z then x<=y;
if x<y then x+z<=y+z; or
if x=y then x+z=y+z
but x+z=y+z does not imply that x=y.
The situation for scaling is analogous to that for translation. In our system,
x = .99999
y = 1.0000
z = 1.0001
implies that x<y but not xz<yz as well as xz=yz but not x=y. Also,
x = 1.0000×10-49
y = 1.9999×10-49
z = 2
implies that x=y but not xz=yz as well as xz<yz but not x<y.
In IEEE-754, we can say that
if x<y then xz<=yz;
if xz<yz then x<y; or
if x=y then xz=yz
but xz=yz does not imply that x=y.
There are no numbers between 1.0000 and 1.0001 in our system.
The property of density is lacked by all floating-point systems.
The Property of Uncountability. This property turns out to be linked with the property of density.
Consider the series of numbers en defined by the equation
en = 1/0! + 1/1! + 1/2! + ... + 1/n!
for all n and the set S defined by
S = {e0, e1, e2, e3, ... }.
This infinite set, among the reals, has as its least upper bound the transcendental number e = 2.718281828459045....
However, in our number system, we have
e0 = 1.0000
e1 = 2.0000
e2 = 2.5000
e3 = 2.6667
e4 = 2.7083
e5 = 2.7167
e6 = 2.7181
e7 = 2.7183
e8 = 2.7183
...
...
...
and en = 2.7183 for all n>=7. Not only is this set finite (having only eight distinct elements), it contains an element that is strictly larger than e.
Although there are many more complex issues that come into play, this is ultimately the reason why it is difficult to accurately calculate transcendental functions such as sin(x), cos(x), ex, log(x), and the like on computers.
The situation with square roots is somewhat better.
In our system, Sqrt(4.0000) = 2.0000 since 2.00002 = 4.0000. But 4.0001, 4.0002, and 4.0003 have no square roots since 2.00012 = 4.0004.
By now we have shown that, of the 18 axioms for the real number system presented, floating-point systems only meet 7 in any meaningful way. Floating-point number systems:
But, floating-point systems:
Let's go back to the proof that x = -b/a is the solution to the equation ax + b = 0.
ax+b = 0 given. (ax+b)+(-b) = 0+(-b) by axioms 1 and 15. ax+(b+(-b)) = -b by axioms 2 and 4. ax+0 = -b by axiom 3. ax = -b by axiom 2. a-1(ax) = a-1(-b) by axioms 6 and 16. (a-1a)x = -ba-1 by axioms 9 and 10. 1x = -b/a by axiom 8 and the definition of divide. x = -b/a by axiom 7.
In this proof, we used 11 of the 18 axioms presented. Of these, floating-point systems only meet 4 in any meaningful way.
The most important omission in this case is the lack of closure with respect to multiplication and division. Thus, the solution x = -b/a, may overflow, underflow, or may only be inexactly represented in the system.
The situation is not bleak, however. The errors that typically crop up in floating-point calculations usually can be controlled. Indeed, the solution x = -b/a to ax + b = 0 is usually represented adequately by the implied floating-point calculation.
With new hardware, the cost of doing floating-point calculations has dropped to the point where the programmer can often consider doing some form of verification calculations to guard against bad results.
IEEE-754 is among the best floating-point arithmetic available. Its ever increasing adoption by many companies is beginning to show a positive effect in the amount of truly portable floating-point code than is being written.
Finally, it should be recognized that when a floating-point program does something unexpected, it is not necessarily an indication of bad programming. Many of the errors that plague floating-point systems are fundamental. They cannot be removed by clever programming.
However, they can be lived with. But, that's another paper.