This is Words and Buttons Online — a collection of interactive #tutorials, #demos, and #quizzes about #mathematics, #algorithms and #programming.

Estimating floating-point error the easy way

Floating-point numbers are not real numbers. You can't squeeze the whole infinity into a finite number of bits. You have to accept that in computation, real numbers are always represented with some error.

The common misconception is ”in practice, we have enough bits to make this error negligible.” The problem with this point of view is: quite often it's true, so it perpetuates and enpowers itself until something goes horribly wrong.

Is there enough bits or not depends on these three things:

  1. The stability of the operation.
  2. The ranges of variables involved.
  3. How much error we can tolerate.

So yes, even if it is enough usually, it just cant be enough all the time for all the computations.

Unforeseen floating-point error is the source of the most unpleasant bugs. The bugs that come and go unpredictably. The bugs that don't reproduce on unit-tests and lay low through the integration phase only to be seen by your most important customer.

How to protect yourself from this threat? Engineering has one ultimate answer for this kind of questions and it is:

Measurement!

Measuring the floating-point error using an inverse operation

Here's a cubic equation solver. It is surprisingly computationally rich for such a primitive operation. It has multiple multiplications, trigonometry, and even square roots.

Fortunately, the inverse operation for solving a cubic equation, getting the coefficients from the given roots, is simple.

ax3 + bx3 + cx + d = 0
(x - x1)(x - x2)(x - x3) = 0
a = 1
b = x1 + x2 + x3
c = x1x2 + x2x3 + x3x1
d = x1x2x3

This means that we can measure the error of solving a cubic equation for every 3 roots we want. Yes, it will also have the error of the coefficients computation, but this error should be small enough to be considered negligible (yes, I see the irony).

Let's do an experiment. Let's measure the relative error for a triplet of roots defined in some exponential range.

x1 = 1×10 i1, x2 = 2×10 i2, x3 = 3×10 i3,
i1 = -10, -8,... 10,
i2 = -10, -8,... 10,
i3 = -10, -8,... 10.

We get a 3D grid of errors for different configurations of roots' orders of magnitude. Here it is.

You can click or drag over the left side of the cube.

The black cells are where the algorithm failed to find all the roots. The red ones are the ones with the largest error. The green ones are the ones with the least error. All and all, the error changes from e-16 to e12. Remember, this is relative error, so e12 for a root of e10 is actually an enormous number that looks like this: 10'000'000'000'000'000'000'000. That's our worse error.

There is a pattern here. The green area shoots through the diagonal. This means that it's all good when the orders of the roots are roughly the same. Whether they are e-8 or e2 or e+10. It's all fine as long as there is no significant difference. But when there is, he have trouble.

I once had a really nasty bug that eventually boiled down to the equation with one root being exactly 600, and the other two at about 2e-5. I had to track it down, and, of course, it really helped me to have the error measured. But it also became an eye-opening experience. These are not astronomical numbers. It's like 2 cents and 600'000 euros in one equation. Not even millions.

So here is a trivial advice: when in doubt — measure. When not in doubt — measure and be surprised.

Estimating the floating-point error using intervals

But what if we don't have a reverse operation at all? Or what if it introduces the greater error than the operation itself? It's problematic to calculate computational error when we don't have a reliable way to obtain the ground truth. Can we know the error without the actual measurements?

Well, no.

But we can guess.

Let's consider a real number representation to be not a single floating-point number, but an interval between two adjacent floating-point numbers. An operation on intervals widens the resulting interval depending on the values involved.

This is a simplification since we don't account for proper rounding. But since we're only estimating the error and not doing the actual computation, it's fine.

This way we can measure how the difference between the real numbers and their floating-point representation accumulates through the calculation. We measure not the actual error, but the width of the interval that definitely holds our value.

Unless our algorithm is stable, operating on intervals make them larger and larger, and the size of each interval subsequently represents the potential error we might face. Of course, out real error should be smaller than that, so measuring interval sizes doesn't substitute the exact measurements. Still, for our cubic solver, the estimation actually follows the measurements rather well.

You don't have to rewrite all your code to work on intervals instead of floats only to get these estimates. For instance, in C++ you can make an interval representing type disguise itself as a double with a simple #define. Yes, it's a dirty hack, and you have to be careful to have all your maths functions reimplemented for the intervals, but it has its perks.

First of all, it's simple. You do something like this:

#include "double_interval.hpp"

#define double DoubleInterval
#include "cubic_solver.hpp"

...
	

And all your doubles acquire from and to fields automatically. You just run your tests then and check the intervals. Simple!

Second, it enables you to do what almost no one does but almost everyone should. It enables you to debug computational errors. You can see them, you can trace the execution to find their sources, you can find a computational issue before it finds you.

Conclusion

Floating-point computational error is measurable and manageable. By having it under control you can make your clients happy, and yourself — free from the nights of fruitless debugging.