FLOATING-POINT FALLACIES


Dan Zuras


Hewlett-Packard Co.



ABSTRACT


Counter The puzzling behavior of some programs that use floating-point arithmetic to solve problems can often elicit some interesting questions. "Why is it that 0.2×5.0 results in 0.9999999 not 1.0 ?" "Why is sin(3.14159265) equal to -8.742276e-8 not 0.0 ?" "Why does SPICE say there is 5000V on this open node when the supply voltage is only 5V ?" "How can the total system usage be 100.1% ?" Common to all of these problems in the expectation that floating-point numbers behave like real numbers. They don't. Why they don't is the subject of this paper.

Introduction

Those who use and write programs that use floating-point to solve problems have a common problem. Floating-point numbers and operations are expected to behave like real numbers and operations. This quite reasonable expectation is, unfortunately, never true. All floating-point systems differ from the real number system in most of their basic assumptions.

In this paper, these differences will be explored. The axiomatic basis for the real number system will be presented with a system of 18 axioms. Examples of how most floating-point systems violate most of these axioms will then be shown. There will be particular emphasis on the behavior of the IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Std 754-1985.

Real Numbers

What are the properties of the real number system? In a nutshell, the real number system is an uncountably dense ordered field. That means, among other things, that it is a group over addition, a group over multiplication, it is ordered, it is dense, and there are uncountably many real numbers. We will look at each of these properties in turn.

Properties of Addition. In order to be a group over addition, a number system must obey the following axioms.

  1. Closure. The number system must be closed. That is, for any two numbers, x and y, in the system, their sum, x+y, is also in the system.
  2. Identity. There is a unique number, called zero or 0, that is the identity for addition. That is, for any number, x, we have x+0 = 0+x = x.
  3. Inverses. For every number, x, there is a unique number, denoted by -x, such that x+(-x) = (-x)+x = 0. This leads to defining the subtract operation as x-y = x+(-y).
  4. Associative. Addition is associative. That is, for every three numbers, x, y, and z, we have (x+y)+z = x+(y+z).
  5. Commutative. The previous axioms define a group. The real numbers are also a commutative or Abelian group over addition. That is, for every two numbers, x and y, we have x+y = y+x.

Properties of Multiplication. The real numbers (excluding zero) also form a group over multiplication.

  1. Closure. For all numbers, x and y, their product, xy, is also a number.
  2. Identity. There is a unique number, called one or 1, such that for all x we have 1x = x1 = x.
  3. Inverses. For all x, there is a unique number, denoted by x-1, such that xx-1= x-1x = 1. The divide operation is thus defined as x/y = x(y-1).
  4. Associative. For all x, y, and z, we have (xy)z = x(yz).
  5. Commutative. For all x and y, we have xy = yx.

In addition to these axioms, addition and multiplication together must obey another axiom.

  1. Distributive. For all x, y, and z, we have x(y+z) = xy+xz.

The axioms listed so far define a field. There are many different fields that meet these axioms that are quite different from real numbers. In particular, the complex numbers are also a field.

So, how are complex numbers different from real numbers? The simplest feature that distinguishes these two types of fields is that the real numbers are ordered and the complex numbers are not.

Properties of Relations. A field is said to be ordered (actually, totally ordered, in the case of real numbers) if one can define the relations < (less-than), > (greater-than), and = (equal-to) so that they have the following properties.

  1. Trichotomy. For all x and y, exactly one of x<y, x>y, or x=y, is true.
  2. Reflexive. For all x and y, if x<y, then y>x and if x=y, then y=x.
  3. Transitive. For all x, y, and z, if x<y and y<z then x<z. Also, if x=y and y=z then x=z.

Relations also have the following properties with respect to addition and multiplication.

  1. Translation Invariance. For all x, y, and z, x+z<y+z if and only if x<y. Also, x+z=y+z if and only if x=y.
  2. Scaling or Gauge Invariance. For all positive x, y, and z, xz<yz if and only if x<y. Also, xz=yz if and only if x=y.

The Property of Density. The real numbers have another important property. They are dense.

  1. Density. For all x and y, if x<y then there is a z such that x<z<y. That is, we can always find another number between any two given numbers. An obvious example is z = (x+y)/2.

We now have a set of axioms that describes both the rational number system and the real number system. We must now come up with a property that the reals (numbers like 5, 1/3, Sqrt(2), pi) have that the rationals (5 and 1/3) don't. It is an obscure one that is related to the limits of infinite series.

The Property of Uncountability. There are a great many more real numbers than rationals. This property has come to be known as uncountability.

  1. Limits Axiom. Any non-empty set of numbers that has an upper bound has a least upper bound. That is, for some set S, if there is a y such that for all x in S we have x<y, then there is a unique z such that x<z for all x in S and z<y for any upper bound y.

For example, if S is the set

{3, 3.1, 3.14, 3.141, 3.1415,...}

every element in the set is a rational number. Also, 4 is an upper bound for that set since 4 is greater than any number in the set. 3.2 is also an upper bound for the set. But the least upper bound is the number pi, which is not a rational number. In this way, irrational numbers show up as the limits of a series of rational numbers.

Algebra

A number system that meets the above axioms is sufficiently rich to allow us to solve some simple equations. For example the equation

ax+b = 0

has the solution

x = -b/a

by use of these axioms. To illustrate how much we use these axioms without being aware of them, lets go through the proof.

ax+b = 0 given.
(ax+b)+(-b) = 0+(-b) by axioms 1 and 15.
ax+(b+(-b)) = -b by axioms 2 and 4.
ax+0 = -b by axiom 3.
ax = -b by axiom 2.
a-1(ax) = a-1(-b) by axioms 6 and 16.
(a-1a)x = -ba-1 by axioms 9 and 10.
1x = -b/a by axiom 8 and the definition of divide.
x = -b/a by axiom 7.

Similarly, systems of linear equations in many unknowns can also be solved.

But in order to solve quadratic equations we should define the square root operation.

  1. Square Root. For all non-negative x, there is a number called Sqrt(x) which is the solution of the equation y2 = x. Thus, we have that (Sqrt(x))2 = x.

With this definition we can now solve equations of the form

ax2+bx+c = 0

using the quadratic formula.

One might add more definitions to cover cubic or higher order equations, infinite series or products, and differential equations. But this system is adequate to demonstrate the difference between real numbers and floating-point numbers.

Floating-Point Numbers

Floating-point numbers have become, over the years, the way in which real numbers are represented on computers. This representation varies from machine to machine but has properties common to all.

All floating-point numbers may be broken into three fields: the sign; the exponent; and the mantissa. The latter two fields contain values in positional notation in some base b. Typical values for b are 2, 10, and 16 but other bases have been used. IEEE-754 uses b=2.

These fields have the following meanings.

  1. Sign. The sign field consists of one bit that distinguishes positive numbers from negative numbers. The sign field may be by itself or combined with the mantissa field in some systems.

The number zero often presents a problem for this field. Indeed, many floating-point systems have both +0 and -0. IEEE-754 is one such system.

  1. Exponent. The exponent field contains a value to which the base b is raised to give the scale of the mantissa field. The exponent field is usually given in a fixed number of base-b digits. It may be in b's-complement notation or it may be biased by some constant.

In IEEE-754 there are 8 bits in the single precision exponent field biased by the constant -127. In double precision, there are 11 bits biased by -1023.

  1. Mantissa. (This field is variously named mantissa, fraction or significand with slightly different definitions for each.) The mantissa field contains the magnitude of the value being represented. It is also usually given in a fixed number of base-b digits. It can be combined with the sign field if the two are represented in b's-complement notation.

We now have a system in which numbers are represented in a way which is analogous to scientific notation. For example, if b=10, the numbers .000002, -3, 137.03604, 186282.396, and -4000000 can now be represented as 2×10-6, -3×100, 1.3703604×102, 1.86282396×105, and -4×106, respectively.

These representations are not unique. For example, the number 137.03604 can be represented equally well by 1.3703604×102, 137.03604×100, or .0013703604×105. The concept of normalization is thus introduced to eliminate this redundancy.

A number is said to be normalized if the value of the mantissa field, m, is in the range 1<=m<b. Since it is always possible to normalize any number other than zero, most floating-point systems deal only with normalized numbers. (In a b's-complement system we want -b<=m<-1 for negative m.)

If b=2, the requirement that the mantissa be normalized implies that the most significant bit will always be one. If a bit is always one it need not be stored to be known. This is known as a `hidden bit'. Some special provision must be made for the value zero but in all other cases this increases the effective size of the mantissa by one bit.

IEEE-754 has 23 bits for the mantissa field in single precision and 52 bits in double. IEEE-754 uses a `hidden bit' to increase the effective size of this field to 24 and 53 bits, respectively.

Representation Error

Most floating-point systems use a fixed finite amount of space in which to store their numbers. Therefore, such systems can only represent a finite number of values. Since there are infinitely many real numbers, most of them will not be exactly representable as a floating-point number. The error thus introduced in trying to represent these numbers is thus called representation error. This error takes basically two forms.

  1. Discreteness Error. A mantissa represented as a fixed finite number of digits cannot represent two numbers arbitrarily close together.

For example, if b=10 and the mantissa is represented by five base-10 digits, the numbers 2.71828 and 2.71829 cannot be represented exactly. If, as is usually done, these numbers are both represented by the five digit number 2.7183, then they cannot be distinguished. This is often the first form of error introduced into a calculation.

  1. Exponent Range Error. An exponent represented as a fixed finite number of digits cannot represent arbitrarily large or small numbers.

In base-10, if the exponent is represented by two base-10 digits, the numbers 2×10-60 and 3×101000 cannot be represented. Only numbers whose absolute values are between 10-50 and 1050 can be represented.

Of these errors, the first is more common but usually less serious. When a number falls outside the exponent range, however, there is no `nearby' number that may be used instead. In most floating-point systems, if a number is too small to be represented, the number zero is substituted. If a number is too large, the largest representable number of the appropriate sign is often substituted.

Calculation Error

Just as representation error happens when numbers cannot be accurately represented, so calculation error happens when operations cannot be accurately performed.

When a calculation is performed on floating-point numbers, the result is often not exactly representable as a floating-point number. This error, which derives from the finite nature of floating-point systems, also take two forms.

  1. Roundoff Error. If the result of a floating-point operation lies between two representable numbers, an error will be committed in coercing the result to the nearest representable number.

For example, the product of the two numbers 2.2002 and 2.0002 is 4.40084004. In a base-10 system with five mantissa digits an error of .00004004 will be committed by coercing the result to 4.4008.

  1. Overflow and Underflow Error. If the result of a floating-point operation is too large or too small to be represented, an overflow or underflow, respectively, is said to have occurred.

IEEE-754 addresses both of these problems. It reserves the smallest exponent value to represent those values for which the hidden bit is zero. In this way, increasingly small numbers (including zero) can be represented with decreasing accuracy. These numbers, known as denormalized numbers allow for a phenomenon known as gradual underflow.

IEEE-754 also reserves the largest exponent value for special numbers. Two of these numbers are +infinity and -infinity. These numbers are used to represent values too large to otherwise be represented.

Floating-Point Fallacies

Given everything we now know about both floating-point numbers and real numbers we can now show some of the fallacies involved in thinking that they behave alike. We will do this by going over the axioms of the real number system one by one and showing examples of how a sample floating-point system behaves for each axiom.

The sample floating-point system we will use is a base-10 system with five mantissa digits and two exponent digits but the examples will have analogs in any floating-point system. We will also occasionally allow the sample floating-point system to have denormalized numbers for its least exponent and the two numbers +infinity and -infinity. This will allow us to illustrate how IEEE-754 behaves in certain circumstances.

Properties of Addition. How do floating-point systems behave with respect to addition?

  1. Closure. Floating-point systems are not closed with respect to addition.

In our sample system we find that

5.0002 + 5.0002 = 10.000

not 10.0004 as it should. Also,

5.0002×1049 + 5.0002×1049 = 9.9999×1049 or +infinity

not 1.00004×1050.

  1. Identity. All floating-point systems have a zero. They often have more than one.

Signed magnitude systems (like IEEE-754) have both +0 and -0. In addition, most numbers in all floating-point systems are pseudo-zeros. These are numbers z such that x+z = z+x = x for some x.

For example, in our sample system, the number 1 is a pseudo-zero for any number larger than 106 since

1.0000×106 + 1 = 1.0000×106

not 1.00001×106.

  1. Inverses. In a b's-complement system, the smallest positive number and the largest negative number don't have corresponding additive inverses. However, with this exception, all floating-point systems have additive inverses for all of their numbers.

They often have more than one inverse for some numbers. For example, let x = 1.5000×10-49 and x = -1.0000×10-49. Since the sum of these two numbers (5.0000×10-50) is too small to be represented in our sample system, we have that x + y = 0. Therefore, y is an additive inverse for x.

IEEE-754 doesn't suffer from this problem because of the existence of the denormalized numbers. This allows very small numbers to be represented with less than normal accuracy. But, it has the effect of allowing all sufficiently small add and subtract operations to be performed exactly.

If we allow our sample system to have denormalized numbers, the sum of these two numbers is representable as the number 0.5000×10-49. This allows all the additive inverses in the IEEE-754 system to be unique.

  1. Associative. No floating-point system is associative.

A mild example is

(3.1415 + 1000.1) - 1000.0
= 1003.2 - 1000.0
= 3.2000

whereas

3.1415 + (1000.1 - 1000.0)
= 3.1415 + .10000
= 3.2415
.

Only three of the five digits in the mantissa were lost in this case. They can all be lost, as in

(.31415 + 10000.) - 10000.
= 10000. - 10000.
= 0.0000

vs.

.31415 + (10000. - 10000.)
= .31415 - 0.0000
= .31415
.

This problem, in which a small error committed in one operation is amplified by another operation, is common to all floating-point systems. Sometimes the problem can be so severe, as in

(3.1415+1.0000×1040) - 1.0000×1040 = 0.0000

vs.

3.1415+(1.0000×1040 - 1.0000×1040) = 3.1415,

that even a higher precision (say 10 or 20 digits in the mantissa) will not make the error smaller.

  1. Commutative. Almost all floating-point systems are commutative with respect to addition.

It is easy to arrange for this to be true. The operands to an add or subtract operation must be compared anyway to determine which is the greater. That means that the operation is always effectively done as <greater>±<lesser> so the operand order doesn't matter.

Properties of Multiplication. The situation with respect to multiplication is similar to that for addition.

  1. Closure. No floating-point system is closed with respect to multiplication.

In our sample system we have

2.2002×2.0002 = 4.4008

not 4.40084004. Also,

2.2002×1030 × 2.0002×1035 = 9.9999×1049 or +infinity

not 4.40084004×1065.

  1. Identity. All floating-point systems can represent the number 1 exactly. In all but a very few systems 1x = x1 = x for all x.

On many systems, however, there are some numbers u near 1 such that ux = xu = x for some x.

In our system, if u = .99999 we have that ux = xu = x for all x in (1×10k,5×10k) for all k.

  1. Inverses. Floating-point systems will have some numbers that have unique multiplicative inverses. Some will have more than one. Some may have none.

The number 1.0001 may use any number from .99990 to .99995 as an inverse but the number .99999 has none as

.99999×1.0000 = .99999

and

.99999×1.0001 = 1.0001.

In binary the situation is a little better but all systems, including IEEE-754, suffer from this problem.

This also implies that the definition of divide as x/y = x(y-1) fails to work when implemented in floating point. For example,

18.000/3.0000
= 18.000×.33333
= 5.9999

if we use .33333 as (3.0000)-1 and

18.000/3.0000
= 18.000×.33334
= 6.0001

if we use .33334 as (3.0000) -1 . As a result, divide must be implemented as a separately defined primitive if accurate results are desired.

Most floating-point systems implement divide as a separate primitive. An interesting exception is the Cray machine which implements divide as a reciprocation step followed by a multiply. Reciprocation can be accomplished by a table lookup and an adjustment that amounts to a multiplication or two. Thus, in a very high performance machine that is highly pipelined, a faster divide can be obtained in this fashion at the cost of some accuracy.

  1. Associative. No floating-point system is associative with respect to multiplication.

Except in cases where the intermediate result overflows or underflows, this error is usually small.

Consider

(1.0002×.99995)×.99995
= 1.0001×.99995
= 1.0000

vs.

1.0002×(.99995×.99995)
= 1.0002×.99990
= 1.0001
.

The first has two nearly worst case rounding errors and the second has almost no error.

  1. Commutative. Almost all floating-point systems are commutative with respect to multiplication.

Again, an exception is the double precision floating-point multiply on the Cray machine. This multiply, in order to save time, doesn't calculate all the intermediate products necessary to deliver a completely rounded result. And since one operand (the multiplier) is treated differently from the other (it is Booth encoded), the result may vary depending on the order of the operands.

  1. Distributive. No floating-point system is distributive.

In our system we have

.99995×(1.0001 + 1.0001)
= .99995×2.0002
= 2.0001

but

.99995×1.0001 + .99995×1.0001
= 1.0000 + 1.0000
= 2.0000
.

Properties of Relations. How the relations work on a given floating-point system often depends on whether a relation is considered an arithmetic operation or not.

For example, some floating-point systems determine if x>y by seeing if x-y>0. Since the subtract operation suffers from roundoff error, overflow and underflow, a relation that uses it will inherit these problems.

On the other hand, most floating-point systems allow relations to be determined by a corresponding set of integer relations on the fields of a floating-point number. Thus, to decide if x>y, first look at the signs. If that doesn't answer the question, look at the exponents. If that doesn't settle it, look at the mantissas. Relations in IEEE-754 may be stated in this way.

  1. Trichotomy. In many floating-point systems, one can't always decide if x<y, x>y, or x=y with complete certainty.

Assume relations are implemented as a subtract and there are no denormalized numbers in the system. If x = 1.5000×10-49 and y = 1.0000×10-49 then, since x-y = 0, we have that x=y is true.

Relations in the IEEE-754 system don't suffer this problem but they have another. There are some distinguished numbers in the system called NaNs (for Not-a-Number). These numbers are there primarily for diagnostic purposes. However, since they have no value, the result of comparing them with normal numbers is || (for unordered). That is, a NaN is not greater-than, nor less-than, nor equal-to any number. So, the best one can say about any two numbers in the IEEE-754 system is that either x<y, x>y, x=y, or x||y (read as x is unordered with respect to y).

  1. Reflexive. Even in a floating-point system where the relations cannot be determined with complete certainty, the relations are usually reflexive.

The exception is a b's-complement system that uses subtraction to determine a relation. Since, in such a system there are some numbers that have no negatives, we may have that x<y but not y>x. (Or that x=y but not y=x.)

IEEE-754 has none of these problems and is reflexive.

  1. Transitive. Not all floating-point systems obey transitivity in their relations. We may have, for example, that x=y and y=z but x<z.

In our sample system (without denormalized numbers) with

x = 1.0000×10-49
y = 1.5000×10-49
z = 2.0000×10-49

since y-x = 0, z-y = 0, and z-x = 1.0000×10-49, we have that x=y, y=z, but x<z if subtraction is used for relations.

IEEE-754 doesn't exhibit this problem and is transitive.

  1. Translation Invariance. No floating-point system is guaranteed to maintain a relation across a translation.

In some floating-point systems, such as our sample one, we may have any of the following happen:

x = 1.0000×10-49
y = 1.9999×10-49
z = 8.0006×10-49

gives x+z<y+z but not x<y as well as x=y but not x+z=y+z. Also,

x = 1
y = 2
z = 106

gives x<y but not x+z<y+z as well as x+z=y+z x+z=y+z but not x=y.

The best that can be said about a system with good arithmetic and good relations (such as IEEE-754) is

if x+z<y+z then x<=y;
if x<y then x+z<=y+z; or
if x=y then x+z=y+z

but x+z=y+z does not imply that x=y.

  1. Scaling or Gauge Invariance. No floating-point system can always maintain a relation across a rescaling.

The situation for scaling is analogous to that for translation. In our system,

x = .99999
y = 1.0000
z = 1.0001

implies that x<y but not xz<yz as well as xz=yz but not x=y. Also,

x = 1.0000×10-49
y = 1.9999×10-49
z = 2

implies that x=y but not xz=yz as well as xz<yz but not x<y.

In IEEE-754, we can say that

if x<y then xz<=yz;
if xz<yz then x<y; or
if x=y then xz=yz

but xz=yz does not imply that x=y.

  1. Density. No floating-point system is dense. It is always possible to find two distinct numbers with no number in between.

There are no numbers between 1.0000 and 1.0001 in our system.

The property of density is lacked by all floating-point systems.

The Property of Uncountability. This property turns out to be linked with the property of density.

  1. Limits. It is true that any non-empty set of floating-point numbers that has an upper bound also has a least upper bound. But since no floating-point system is dense, we can never represent infinite sets with limit points between two adjacent floating-point numbers. Thus, calculations that generate irrational numbers cannot be accurately performed in a floating-point system.

Consider the series of numbers en defined by the equation

en = 1/0! + 1/1! + 1/2! + ... + 1/n!

for all n and the set S defined by

S = {e0, e1, e2, e3, ... }.

This infinite set, among the reals, has as its least upper bound the transcendental number e = 2.718281828459045....

However, in our number system, we have

e0 = 1.0000
e1 = 2.0000
e2 = 2.5000
e3 = 2.6667
e4 = 2.7083
e5 = 2.7167
e6 = 2.7181
e7 = 2.7183
e8 = 2.7183
...
...
...

and en = 2.7183 for all n>=7. Not only is this set finite (having only eight distinct elements), it contains an element that is strictly larger than e.

Although there are many more complex issues that come into play, this is ultimately the reason why it is difficult to accurately calculate transcendental functions such as sin(x), cos(x), ex, log(x), and the like on computers.

The situation with square roots is somewhat better.

  1. Square Root. As with multiplicative inverses, some numbers may have unique square roots or some may have none. Unlike multiplicative inverses, any reasonable floating-point system will not have some numbers with multiple square roots.

In our system, Sqrt(4.0000) = 2.0000 since 2.00002 = 4.0000. But 4.0001, 4.0002, and 4.0003 have no square roots since 2.00012 = 4.0004.

Can We Do Algebra?

By now we have shown that, of the 18 axioms for the real number system presented, floating-point systems only meet 7 in any meaningful way. Floating-point number systems:

But, floating-point systems:

Let's go back to the proof that x = -b/a is the solution to the equation ax + b = 0.

ax+b = 0 given.
(ax+b)+(-b) = 0+(-b) by axioms 1 and 15.
ax+(b+(-b)) = -b by axioms 2 and 4.
ax+0 = -b by axiom 3.
ax = -b by axiom 2.
a-1(ax) = a-1(-b) by axioms 6 and 16.
(a-1a)x = -ba-1 by axioms 9 and 10.
1x = -b/a by axiom 8 and the definition of divide.
x = -b/a by axiom 7.

In this proof, we used 11 of the 18 axioms presented. Of these, floating-point systems only meet 4 in any meaningful way.

The most important omission in this case is the lack of closure with respect to multiplication and division. Thus, the solution x = -b/a, may overflow, underflow, or may only be inexactly represented in the system.

The situation is not bleak, however. The errors that typically crop up in floating-point calculations usually can be controlled. Indeed, the solution x = -b/a to ax + b = 0 is usually represented adequately by the implied floating-point calculation.

With new hardware, the cost of doing floating-point calculations has dropped to the point where the programmer can often consider doing some form of verification calculations to guard against bad results.

IEEE-754 is among the best floating-point arithmetic available. Its ever increasing adoption by many companies is beginning to show a positive effect in the amount of truly portable floating-point code than is being written.

Finally, it should be recognized that when a floating-point program does something unexpected, it is not necessarily an indication of bad programming. Many of the errors that plague floating-point systems are fundamental. They cannot be removed by clever programming.

However, they can be lived with. But, that's another paper.