This is Words and Buttons Online — a collection of interactive #tutorials, #demos , and #quizzes about #mathematics, #algorithms and #programming.

Redundant stories about redundancy

This is a buffet of redundant stories. I provide several pieces and you choose which of them to read and share, and in which order. Every piece contains the same thought only put differently so don't be afraid to miss anything. Experiment and enjoy!

↓ drag the pieces in and out to form your own page.

Historical anecdote

Let’s say you want to keep your core temperature under control. You put a thermal sensor there, wire it out to some kind of SCADA, and here you go. Except, since it’s very important, you can’t rely on just one wire. What if for whatever reason it snaps?

So you add another scanner and another wire. It’s now more reliable. But what if one of the scanners shows 300 degrees Celsius, and the other 1234? Should you shut down the reactor or replace the broken scanner?

Well, you add another scanner and another wire. Now it’s either 300, 300, 1234, or 1234, 1234, 300. You can now make a decision.

In Ukraine, up until the early 90s, components like these were triplicated on all the nuclear power plants. But then the collaboration with IAEA started and they brought in the new question: let’s say you know that you need to replace the broken sensor. Now how can you be sure, you’re getting adequate data while you’re replacing it?

The rest of the world was already quadruplicating their components, but the Ukrainian NPP industry didn’t yet have a reliable quadruplicator — a special device to bring four signals together. Developing and producing this device, considering the strictest reliability requirements, would be too risky at that point, so they decided to duplicate a triplicator instead.

Let that sink in. Multiplicating the proven multiplicator was considered a better option than introducing a completely new component to the system.

Now, what can we, as software engineers, learn from that?

Well, nothing. Because we somehow presume that software components are flawless. They are, of course, not, but this is the model we chose to believe in. A diode can fail, a resistor can burn, a capacitor can leak, but a hello-world is a hello-world forever and ever.

If software is flawless, then redundancy brings no value. But empirically, it does. To understand why is it so appreciated in the world of safety-critical development, and to learn to use it for our advantage, we first have to abandon the notion of software infallibility.

Practical example

I had to build a very cool thing I can’t tell much about for legal reasons. I can tell about its build process though. It was supposed to be a CUDA thing wired into a C++ code built with CMake and running on Linux. The build instruction, actually a Docker file, was explicit about versions but only to the point at which it works in Docker. I wanted to build the thing on WSL, and this brought enough uncertainty to make the build system crumble.

What I found out that weekend. For some reason, CMake versions 16 and higher don’t bootstrap on GCC 5 to 7. But only on WSL2. On WSL, they do, but CUDA doesn’t see your GPU. The target architecture for CUDA is set differently in CMake 17 and CMake 18. And clang-tools and clanglib can but really shouldn't belong to different versions of Clang.

There were also troubles with different C++ standards and dialects. The most annoying, the most trivial, and the most unnecessary ones. Like on MSVC, you can get away with messages in plain std::exceptions, on GCC, you can not. In C++17, there are messageless static_asserts, but in C++14, all the static asserts should be supplied with a message string.

I spent my time trying different versions of things until finally, it all worked.

The build process was too consecutive, a failure in any subsystem caused all the build to fail. And if you even tried CMake, CUDA, and even cross-platform C++, you know how fragile these things are. You can’t build a reliable consecutive system out of unreliable subsystems. It’s simple math.

But in fact, the build process had plenty of redundancy within. It’s just I had to manage it manually.

When CMake 17 didn’t work, I tried 18. When clang 10 wasn’t enough, I installed clang 11. When CUDA 11 requested an std=c++17 option, I added this option. This variability made the build possible both mathematically and practically.

Come to think about it, this is just insane. You don’t need a person to switch connectors in a nuclear core. In fact, you don’t want this person there. And the person doesn’t want to be there. So electrical engineers found a way to automate this. But we, software engineers, who should be ahead in any possible automation, are still switching subsystems manually.

We can, of course, introduce redundancy to our systems as well. But this will require a shift in perception. We would have to admit that software is just as fault-prone as hardware. A build can eat up your disk space, a package can be unavailable because of network issues, even a compiler can miscompile.

Redundant stories about redundancy

Introduction

Historical anecdote

Theoretical grounds

Practical example

Conclusion