Comment by codebje
21 hours ago
Specifications that are formally verified can definitely cover real-time guarantees, behaviour under error returns from operations like allocation, and similar things. Hardware failures can be accounted for in hardware verification, which is much like software verification: specification + hardware design = verified design; if the spec covers it, the verification guarantees it.
Considering software alone isn't pretty useless, nor is having the guarantee that "inc x = x - 1" will always go from an Int to an Int, even if it's not "fully right" at least trying to increment a string or a complex number will be rejected at compile time. Giving up on any improvements in the correctness of code because it doesn't get you all the way to 100% correct is, IMO, defeatist.
(Giving up on it because it has diminishing returns and isn't worth the effort is reasonable, of course!)
Hardware verification doesn't prevent hardware failures. There is a reason RAM comes with ECC. It's not because RAM designers are too lazy to do formal verification. Even with ECC RAM, bit flips can still happen if multiple bits flip at the same time.
There are also things like CPUs taking the wrong branch that occasionally happen. You can't assume that the hardware will work perfectly in the real world and have to design for failure.
Well of course hardware fails, and of course verification doesn't make things work perfectly. Verification says the given design meets the specification, assumptions and all. When the assumptions don't hold, the design shouldn't be expected to work correctly, either. When the assumptions do hold, formal verification says the design will work correctly (plus or minus errors in tools and materials).
We know dynamic RAM is susceptible to bit-flip errors. We can quantify the likelihood of it pretty well under various conditions. We can design a specification to detect and correct single bit errors. We can design hardware to meet that specification. We can formally verify it. That's how we get ECC RAM.
CPUs are almost never formally verified, at least not in full. Reliability engineering around systems too complex to verify, too expensive to engineer to never fail, or that might operate outside of the safe assumptions of their verified specifications, usually means something like redundancy and majority-rules designs. That doesn't mean verification plays no part. How do you know your majority-rules design works in the face of hardware errors? Specify it, verify it.
Designing around hardware failure in software seems cumbersome to insane. If the CPU can randomly execute arbitrary code because it jumps to wherever, no guarantees apply.
What you actually do here is consider the probability of a cosmic ray flip, and then accept a certain failure probability. For things like train signals, it's one failure in a billion hours.
> Designing around hardware failure in software seems cumbersome to insane.
Yet for some reason you chose to post this comment over TCP/IP! And I'm guessing you loaded the browser you typed it in from an SSD that uses ECC. And probably earlier today you retrieved some data from GFS, for example by making a Google search. All three of those are instances of software designed around hardware failure.
2 replies →
An approach that has been taken for hardware in space is to have 3 identical systems running at the same time.
Execution continues while all systems are in agreement.
If a cosmic ray causes a bit-flip in one of the systems, the system not in agreement with the other two takes on the state of the other two and continues.
If there is no agreement between all 3 systems, or the execution ends up in an invalid state, all systems restart.
>Designing around hardware failure in software seems cumbersome to insane
I mean there are places to do it. For example ZFS and filesystem checksums. If you've ever been bit by a hard drive that says everything is fine but returns garbage you'll appreciate it.
Yet, big sites like Google or TikTok constantly deal with hardware failures everyday while keeping their services and apps running.