This is Words and Buttons Online — a collection of interactive #tutorials, #demos, and #quizzes about #mathematics, #algorithms and #programming.

Yet another floating-point tutorial

Why though?

I know, I know, the topic is already covered by some excellent tutorials and explanations. To name a few,

But I think that reexplaining some obscure concepts with different words (and buttons) might still help someone understand them better. And in the end, make fewer mistakes, write faster code, and create better software in general.

So I wrote this tutorial. It has C++ code samples, clickable models, and even quests and puzzles. I used everything, every trick I know to explain things in the clearest way possible. If you think your understanding of floating-point computations still could be improved after reading all the other tutorials, this page should help you.

Let's get started, shall we?

But first! Meet the 6-bit integer number

Floating-point numbers were invented to represent real numbers. Like 0.5 or 3.1415926. The ones that can't be represented by integers like 1 or 3. But in fact, floating-point numbers are made of integers, so you should probably learn them first.

At the very low-level, all computers do is crunch integers. And only some finite subset of them. Computers operate with bit tuples, and there are only as many different combinations of bits as 2n where n is the length of a tuple, and these combinations are your numbers.

If you have 1 bit, you can only have two numbers: 0 and 1. If you have 2 bits, there are 0, 1, 2, and 3. If you have 8 bits, you have 256 different numbers, if 16 — 65 536, 32 — 4 294 967 296, and so on.

Since 2-bit tuples provide too few numbers and 64 bit — too many, here's a nice and compact model of a 6-bit integer number.

The number-buttons are clickable.

Quest 1. Please make the model show 31.

Integer numbers are not universally standardized but there are some common conventions. For instance, the most common way to represent negative integers is by using the higher half of the range.

This means that by constantly incrementing the model, at some point you'll see: ..., 29, 30, 31, -32, -31, -30, ...

Quest 2. Please make the model show -32.

It looks a little odd when you meet this jump, but it all makes perfect sense when we see all the range in one picture. Let's pick an even simpler model. The 1-digit decimal number will do.

Unsigned 1-digit decimalSigned 1-digit decimal
00
11
22
33
44
5-5
6-4
7-3
8-2
9-1

The first half of the range is shared between signed and unsigned types. Then there's a leap back for the signed ones. After that, they diverge by exactly half a range.

If you take overflows into consideration, it starts to seem natural. Yes, signed numbers overflow at half-range, but unsigned numbers overflow too, only half a range later. It's like two continuous rolls of numbers going along each other.

Unsigned 1-digit decimalSigned 1-digit decimal
......
9-1
00
11
22
33
44
5-5
6-4
7-3
8-2
9-1
00
11
22
33
44
5-5
......

Please don't write your code expecting this behavior, though. Integer overflows are not universally standardized. And even if they work properly for your case, exploiting overflows makes code obscure, error-prone, and often non-portable.

Puzzle 1. What will this C++ program print out?

This puzzle requires significant C++ knowledge. If you're not familiar with the language, feel free to guess.

  // built with Clang 3.8 on Intel Core i7
  uint16_t a = 50'000;
  uint16_t b = 50'000;
  uint64_t c = a*b;
  std::cout << c;

With proper convention, integer numbers may represent some non-integer values. For instance, 0.45 of a meter is 45 centimeters. 45.6 centimeters is 456 millimeters.


456 mm = 45.6 cm = 0.456 m

It's just a matter of representation. It's all about where to put the decimal point.

Speaking of which, meet the 6-bit fixed-point number

With things like price tags or body temperature, we know where the decimal point goes. These things could be manipulated with the decimal point in mind.


$ 9.99

36.6°C

Just as that, we can turn an integer number into a fixed-point number by assigning a place to put the binary point. In this model, the binary point is put after two bits, so it's still an integer, but now it measures quarters, so it can now represent some subset of real numbers such as 1.5.

Fixed-point numbers are still integers, but they are integer numbers of fractions.

Quest 3. Please make the model show -8.

You can add and subtract them like the regular numbers, but you have to introduce more elaborate rules for division and multiplication. And you have to watch for overflows and underflows accordingly.

Sometimes fixed-points are fine, but they can get cumbersome with a lot of calculations. With fixed bit tuple size, you have to choose where to put the binary point to lose as little precision as possible on any arithmetic operation. Wouldn't it be great if something did that for you?

And finally, meet the 6-bit floating-point number

Floating-point numbers are that popular because they work fine even if you don't think about where the decimal point is where it shall appear after the computation.

They come from scientific (also known as exponential) notation. With this notation, you can compactly write both astronomical and subatomic values.


mSun: 1.989e+30 kg
Mass of the Sun, which is 1989000000000000000000000000000 kg

me: 9.1093897e-31 kg
Mass of an electron, which is 0.0000000000000000000000000000091093897 kg

Just like a number in scientific notation, a floating-point number has a sign, some meaningful digits, and an exponent.

The three major differences are:

  1. it's all binary not decimal;
  2. the exponent value goes before meaningful digits;
  3. and since it's binary, and all the numbers, except for 0, start with 1, you don't even have to write down the first 1 most of the time. It will be there implicitly. Saves you a whole bit!

Quest 4. Please make the model show 1.

Only for the very small numbers, when we want our exponent to be smaller than we can afford, we omit the implied 1. In scientific notation, it's like writing 0.00123e-45. It's not normalized. So we call these numbers subnormal numbers or denormalized numbers.

Quest 5. Please make the model show 0.25.

Since you're still working with bits, and every number is still just an integer in disguise, you can imagine a floating-point number as an integer number of 2ns.

Quest 6. Please make the model show 2.

Unlike integer numbers, floating-point numbers are standardized, and the standard specifies several useful conventions. The whole range of possible bits consists not only of numbers, but of two distinct zeros, two values for infinitely large values, and a whole subrange of “not a number”s.



To understand what are the special infinite values for, you should probably understand floating-point zeros first.

They are not actual zeros. Instead, they model anything that is absolutely smaller than the smallest representable number. Of course, they may occasionally represent an actual zero, but there is no way to tell if they are. So they should be treated as small numbers and not 0. And as such, they deserve to retain their signs.

  auto min_float
    = std::numeric_limits<float>::denorm_min();
  std::cout << min_float;
  // prints 1.4013e-45


  std::cout << min_float / 2.f;
  // prints 0


  std::cout << -min_float / 2.f;
  // prints -0
	

The smallest representable number is the smallest denormalized number. One half of it is semantically a number, but pragmatically a zero.

Denormalized values are not universally available. If you don't want that precision, you can win some performance by turning them off.

Speaking of performance, some of the compiler optimizations are algebraic. This means that the compiler optimizes floating-point numbers such as if they were real. Usually, it's not a problem. Sometimes it is.

Puzzle 2. What will this C++ program print out?
  std::cout << 0 - (min_float / 2.f);
		

When you divide a non-zero number by zero, you get infinity. And just like floating-point zeros are not zero, these are not infinite. These are just numbers that are too big to be represented.

There are also numbers that can't possibly be represented in this model at all, like square roots of negative numbers. Technically, they are still numbers, but being complex numbers and not real, they don't have the representation. Operations that take numbers as input and can't provide numbers as output return special “not a number” values then.

Representation error

Of course, numbers that get to be represented in floating-point numbers, are usually represented with an error. We have only that many bits, that many combinations, and the real number's range is infinite. You can't squeeze an infinite range into a finite set.

When a number doesn't have the representation, we pick the nearest number that has one instead.








The difference between their values is our representation error.

Quest 7. Using the dial, please enter a number with the error of exactly 0.001.

The common misconception is: since our smallest representable numbers are small, the representation error should be small too. But remember, we are talking about integer numbers of 2ns. The greater the n, the greater the possible representation error is.

Quest 8. Using the dial, please enter a number with the error of 1000.

Computational error

Still, the representation error is not our worst enemy. Any real-world number we collect from a sensor, or a user interface, has its own error anyway. And it usually dwarfs the representation error of floating-point numbers.

However, as we process the data, we introduce computational error, too.

This still comes from our tiny 6-bit model:
+

Computational error is not imminent. If the arguments are close enough exponentially, the operation result may be represented without an error.

Puzzle 3. That thing with combo-boxes lets you pick 16 variants of 6-bit floats addition. How many of them cause no error?

There is a common belief that comparing two floats exactly is unsafe because of the possible error, but this error is often manageable. For instance, this loop is quite safe (remember, floats are just integer numbers of 2n).

  // 16 iterations loop
  for(auto i = 0.; i != 4; i += 0.25)
      cout << i << ' ';
	

But it might be hard to recognize safe from unsafe. For instance, this one isn't safe since 0.1 is only represented in the floating-point model with an error.

  // "infinite" loop
  for(auto i = 0.; i != 4; i += 0.1)
      cout << i << ' ';
	

The computational error tends to accumulate as the computation goes. Of course, it depends on the algorithm. Some are more prone to accumulate errors than others.

What's worse, this error is hard to tell. Unless you deliberately put some effort into error analysis, you can never be sure how much of your computation is real data, and how much is an error.

Ok, I think we have the basics covered

I hope that the tutorial was fun and educational. I tried to keep it short, though, so it only covers the essentials. For the specific parts, I prepared a few more explanations.

If you want to learn more about how to estimate the computational error, please see Estimating floating-point error the easy way.

If you want to learn how to encode your error messages in “not a number”, and also why, and why not, please visit Error codes are not numbers. But they are. Can we exploit that?

And if you want to see what else you could use to represent real numbers, here's Yet another alternative to floating-point numbers.