Comment by colinb
3 months ago
> code for radiation hardened environments
I’m aware of code that detects bit flips via unreasonable value detection (“this counter cannot be this high so quickly”). What else is there?
3 months ago
> code for radiation hardened environments
I’m aware of code that detects bit flips via unreasonable value detection (“this counter cannot be this high so quickly”). What else is there?
For safety critical systems, one strategy is to store at least two copies of important data and compare them regularly. If they don't match, you either try to recover somehow or go into a safe state, depending on the context.
At least three copies, so you can recover based on consensus.
If your pieces of important data are very tiny, that's probably your best option.
If they're hundreds of bytes or more, then two copies plus two hashes will do a better job.
3 replies →
In many cases the system is perfectly safe when it shuts off. Two is enough for that.
“never go to sea with two chronometers, take one or three”
3 replies →
I use ZFS even on consumer devices, these days. Parity checks all the way!
You can have voting systems in place, where at least 2 out of 3 different code paths have to produce the same output for it to be accepted. This can be done with multiple systems (by multiple teams/vendors) or more simply with multiple tries of the same path, provided you fully reload the input in between.
The simplest one is a watchdog: If something stops with regular notifications, then restart stuff.
A watchdog guards against unresponsive software. It doesn't protect against bad data directly. Not all bad data makes a system freeze.