Comment by jojomodding
17 hours ago
Designing around hardware failure in software seems cumbersome to insane. If the CPU can randomly execute arbitrary code because it jumps to wherever, no guarantees apply.
What you actually do here is consider the probability of a cosmic ray flip, and then accept a certain failure probability. For things like train signals, it's one failure in a billion hours.
> Designing around hardware failure in software seems cumbersome to insane.
Yet for some reason you chose to post this comment over TCP/IP! And I'm guessing you loaded the browser you typed it in from an SSD that uses ECC. And probably earlier today you retrieved some data from GFS, for example by making a Google search. All three of those are instances of software designed around hardware failure.
But you must drive a line somewhere.
If "a cosmic ray could mess with your program counter, so you must model your program as if every statement may be followed by a random GOTO" sounds like a realistic scenario software verification should address, you will never be able to verify anything ever.
An approach that has been taken for hardware in space is to have 3 identical systems running at the same time.
Execution continues while all systems are in agreement.
If a cosmic ray causes a bit-flip in one of the systems, the system not in agreement with the other two takes on the state of the other two and continues.
If there is no agreement between all 3 systems, or the execution ends up in an invalid state, all systems restart.
>Designing around hardware failure in software seems cumbersome to insane
I mean there are places to do it. For example ZFS and filesystem checksums. If you've ever been bit by a hard drive that says everything is fine but returns garbage you'll appreciate it.
Yet, big sites like Google or TikTok constantly deal with hardware failures everyday while keeping their services and apps running.