← Back to context

Comment by colinb

3 months ago

> code for radiation hardened environments

I’m aware of code that detects bit flips via unreasonable value detection (“this counter cannot be this high so quickly”). What else is there?

For safety critical systems, one strategy is to store at least two copies of important data and compare them regularly. If they don't match, you either try to recover somehow or go into a safe state, depending on the context.

You can have voting systems in place, where at least 2 out of 3 different code paths have to produce the same output for it to be accepted. This can be done with multiple systems (by multiple teams/vendors) or more simply with multiple tries of the same path, provided you fully reload the input in between.

The simplest one is a watchdog: If something stops with regular notifications, then restart stuff.

  • A watchdog guards against unresponsive software. It doesn't protect against bad data directly. Not all bad data makes a system freeze.