This is Words and Buttons Online — a collection of interactive #tutorials, #demos, and #quizzes about #mathematics, #algorithms and #programming.
This is a buffet of redundant stories. I provide several pieces and you choose which of them to read and share, and in which order. Every piece contains the same thought only put differently so don't be afraid to miss anything. Experiment and enjoy!
↓ drag the pieces in and out to form your own page.
Component redundancy is used heavily in safety-critical and mission-critical systems for reliability improvement. But outside this niche, it's surprisingly little known in the world of software. Which is a shame since it's a simple but economical idea. It costs nothing to keep in mind, and it saves you a lot on hotfixes and emergency repairs.
Redundancy, however, only makes sense if you have unreliable components and software components are theoretically infinitely reliable. It is only in practice they aren't.
Let’s say you want to keep your core temperature under control. You put a thermal sensor there, wire it out to some kind of SCADA, and here you go. Except, since it’s very important, you can’t rely on just one wire. What if for whatever reason it snaps?
So you add another scanner and another wire. It’s now more reliable. But what if one of the scanners shows 300 degrees Celsius, and the other 1234? Should you shut down the reactor or replace the broken scanner?
Well, you add another scanner and another wire. Now it’s either 300, 300, 1234, or 1234, 1234, 300. You can now make a decision.
In Ukraine, up until the early 90s, components like these were triplicated on all the nuclear power plants. But then the collaboration with IAEA started and they brought in the new question: let’s say you know that you need to replace the broken sensor. Now how can you be sure, you’re getting adequate data while you’re replacing it?
The rest of the world was already quadruplicating their components, but the Ukrainian NPP industry didn’t yet have a reliable quadruplicator — a special device to bring four signals together. Developing and producing this device, considering the strictest reliability requirements, would be too risky at that point, so they decided to duplicate a triplicator instead.
Let that sink in. Multiplicating the proven multiplicator was considered a better option than introducing a completely new component to the system.
Now, what can we, as software engineers, learn from that?
Well, nothing. Because we somehow presume that software components are flawless. They are, of course, not, but this is the model we chose to believe in. A diode can fail, a resistor can burn, a capacitor can leak, but a hello-world is a hello-world forever and ever.
If software is flawless, then redundancy brings no value. But empirically, it does. To understand why is it so appreciated in the world of safety-critical development, and to learn to use it for our advantage, we first have to abandon the notion of software infallibility.
In the world of inherently unreliable components, a. k. a. the real world, introducing redundancy is the only realistic way to improve your system reliability. It’s simple math.
Let’s say you have 3 components. Each has a 10% chance of failure and a 90% chance to work. If you wire them consequently, the reliability of this subsystem would be:
0.9×0.9×0.9 = 0.729
Now if you have the same low-reliability components but you wire them in parallel, so the working ones could substitute the failed, then the reliability of such subsystem would be:
1.0 - (0.1×0.1×0.1) = 0.999
Adding unreliable components in a sequence reduces your system's reliability little by little. But duplicating unreliable components improves your system's reliability drastically. That's how redundancy works.
Hardware guys, especially those who burned enough resistors, realize that. That’s why they duplicate and triplicate anything worth triplicating. We don’t.
Because in software, we assume that the code is always defect-free or at least deterministic. There is a whole field of study that disproves the former and a lot of anecdotal evidence to contradict the latter, but the general notion is: programs are not susceptible to wear and decay, and that alone makes them ultimately reliable.
Of course, if your reliability is measured as 1, then you can multiply it however you like, you'll still get 1. Redundancy is then pointless. It only makes sense if we admit that the software is inherently just as fault-prone as everything else in the world.
And it does. And it is.
I had to build a very cool thing I can’t tell much about for legal reasons. I can tell about its build process though. It was supposed to be a CUDA thing wired into a C++ code built with CMake and running on Linux. The build instruction, actually a Docker file, was explicit about versions but only to the point at which it works in Docker. I wanted to build the thing on WSL, and this brought enough uncertainty to make the build system crumble.
What I found out that weekend. For some reason, CMake versions 16 and higher don’t bootstrap on GCC 5 to 7. But only on WSL2. On WSL, they do, but CUDA doesn’t see your GPU. The target architecture for CUDA is set differently in CMake 17 and CMake 18. And clang-tools and clanglib can but really shouldn't belong to different versions of Clang.
There were also troubles with different C++ standards and dialects. The most annoying, the most trivial, and the most unnecessary ones. Like on MSVC, you can get away with messages in plain std::exceptions, on GCC, you can not. In C++17, there are messageless static_asserts, but in C++14, all the static asserts should be supplied with a message string.
I spent my time trying different versions of things until finally, it all worked.
The build process was too consecutive, a failure in any subsystem caused all the build to fail. And if you even tried CMake, CUDA, and even cross-platform C++, you know how fragile these things are. You can’t build a reliable consecutive system out of unreliable subsystems. It’s simple math.
But in fact, the build process had plenty of redundancy within. It’s just I had to manage it manually.
When CMake 17 didn’t work, I tried 18. When clang 10 wasn’t enough, I installed clang 11. When CUDA 11 requested an std=c++17 option, I added this option. This variability made the build possible both mathematically and practically.
Come to think about it, this is just insane. You don’t need a person to switch connectors in a nuclear core. In fact, you don’t want this person there. And the person doesn’t want to be there. So electrical engineers found a way to automate this. But we, software engineers, who should be ahead in any possible automation, are still switching subsystems manually.
We can, of course, introduce redundancy to our systems as well. But this will require a shift in perception. We would have to admit that software is just as fault-prone as hardware. A build can eat up your disk space, a package can be unavailable because of network issues, even a compiler can miscompile.
Programs are just as faulty as resistors and capacitors. It’s just a matter of scale. Sure, prnit "hello world" is reliable enough. But one defect per every two thousand lines of code is the industry norm these days. Software subsystems are evidently fault-prone.
So why we keep building our systems like they aren’t?
In safety-critical applications, every component is considered faulty. And if you acknowledge faulty components, you have no other option but to think of a way to juggle them for the best possible outcome. Redundancy is not a silver bullet, but it makes building a reliable system out of unreliable components possible.
If we want to build reliable systems, if we even want to start considering reliability, first we have to acknowledge that our components aren't reliable.
Index #programming | ← there's more. |
+ Github & RSS |