Comment by jonathanstrange

1 day ago

No hardware failure is considered? No cosmic rays flipping bits? No soft or hard real-time guarantees are discussed? What about indeterminate operations that can fail such as requesting memory from some operating system dynamically?

I'm asking because I thought high integrity systems are generally evaluated and certified as a combination of hardware and software. Considering software alone seems pretty useless.

Specifications that are formally verified can definitely cover real-time guarantees, behaviour under error returns from operations like allocation, and similar things. Hardware failures can be accounted for in hardware verification, which is much like software verification: specification + hardware design = verified design; if the spec covers it, the verification guarantees it.

Considering software alone isn't pretty useless, nor is having the guarantee that "inc x = x - 1" will always go from an Int to an Int, even if it's not "fully right" at least trying to increment a string or a complex number will be rejected at compile time. Giving up on any improvements in the correctness of code because it doesn't get you all the way to 100% correct is, IMO, defeatist.

(Giving up on it because it has diminishing returns and isn't worth the effort is reasonable, of course!)

  • Hardware verification doesn't prevent hardware failures. There is a reason RAM comes with ECC. It's not because RAM designers are too lazy to do formal verification. Even with ECC RAM, bit flips can still happen if multiple bits flip at the same time.

    There are also things like CPUs taking the wrong branch that occasionally happen. You can't assume that the hardware will work perfectly in the real world and have to design for failure.

    • Well of course hardware fails, and of course verification doesn't make things work perfectly. Verification says the given design meets the specification, assumptions and all. When the assumptions don't hold, the design shouldn't be expected to work correctly, either. When the assumptions do hold, formal verification says the design will work correctly (plus or minus errors in tools and materials).

      We know dynamic RAM is susceptible to bit-flip errors. We can quantify the likelihood of it pretty well under various conditions. We can design a specification to detect and correct single bit errors. We can design hardware to meet that specification. We can formally verify it. That's how we get ECC RAM.

      CPUs are almost never formally verified, at least not in full. Reliability engineering around systems too complex to verify, too expensive to engineer to never fail, or that might operate outside of the safe assumptions of their verified specifications, usually means something like redundancy and majority-rules designs. That doesn't mean verification plays no part. How do you know your majority-rules design works in the face of hardware errors? Specify it, verify it.

    • Designing around hardware failure in software seems cumbersome to insane. If the CPU can randomly execute arbitrary code because it jumps to wherever, no guarantees apply.

      What you actually do here is consider the probability of a cosmic ray flip, and then accept a certain failure probability. For things like train signals, it's one failure in a billion hours.

      6 replies →

Even then you have other physical issues to consider. This is one of the things I love about the Space Shuttle. It had 5 computers for redundancy during launch and return. You obviously don't want to put them all in the same place so you spread them out among the avionics bays. You also obviously don't want them all installed in the same orientation so you install them differently with respect to the vehicles orientation. You also have a huge data network that requires redundancy and you take all the same steps with the multiplexers as well.

  • The best example on the Shuttle were the engine control computers. Each engine had two controllers, primary and backup, each with its own set of sensors in the engine itself and each consisting of a lock-step pair of processors. For each engine, the primary controller would use processors built by one supplier, while the backup would use processors of the same architecture but produced by an entirely different supplier (Motorola and TRW).

    Today, even fairly standard automotive ECUs use dual-processor lock-step systems; a lot the the Cortex-R microcontrollers on the market are designed around enabling dual-core lock-step use, with error/difference checking on all of the busses and memory.

Nowhere does the article claim that:

   "formal verification of the code" -> "high integrity system"

Formal verification is simply a method of ensuring your code behaves how you intend.

Now, if you want to formally verify your program can tolerate any number of bits flip on any variables at any moment(s) in time, it will happily test this for you. Unfortunately, assuming presently known software methods, this is an unmeetable specification :)

For portable libraries and apps, there's only so much you can do. However, there are some interesting properties you can prove assuming the environment behaves according to a spec.

Side channels? Is best out of 2 sufficient or is best out of 3 necessary?

From https://westurner.github.io/hnlog/#search:awesome-safety :

awesome-safety-critical: https://awesome-safety-critical.readthedocs.io/en/latest/

Hazard (logic) https://en.wikipedia.org/wiki/Hazard_(logic)

Hazard (computer architecture); out-of-order execution and delays: https://en.wikipedia.org/wiki/Hazard_(computer_architecture)

Soft error: https://en.wikipedia.org/wiki/Soft_error

SEU: Single-Event Upset: https://en.wikipedia.org/wiki/Single-event_upset

And then cosmic ray and particle physics