Three ways formally verified code can go wrong in practice

1 day ago (buttondown.com)

Formal verification is ultimately an epistemological problem, how do we know that our model of reality corresponds to reality?

People have commented that validation is different from verification, but I think validation can be done by formally specifying the environment the program runs in.

For example, the question whether user can do X in your software corresponds to a question, is there a sequence of user inputs such that the software plus operating system leads to the output X.

By including the environment (like the operating system) into the formal spec we could answer (in theory) such questions.

Even if the question is fuzzy, we can today formalize it with AI models. For example, if we ask, does paint program allow user to draw a rose? We can take a definition of a rose from a neural network that recognizes what a rose is, and formally verify the system of paint program and rose verifier given by the NN.

  • Now you’ve just shifted the problem to a much harder statement: proving the things the NN recognizes as a rose is actually a rose.

    You cannot get rid of the enviroment, and the best thing you can do is be explicit in what you rely on. For example, formally verified operating systems often rely on hardware correctness, even though the errata sheets for modern CPUs are hundreds of pages long.

    • Yeah. That's why I am saying it's the epistemological problem.

      In the above example, that NN recognizes rose as a rose is an assumption that it is correct as a model of the (part of the) world. On that assumption, we get a formal definition of "an image of a rose in the pixelated image", and we use that to formally prove our system allows roses to be painted.

      But getting rid of that assumption is I believe epistemologically impossible; you have to assume some other correspondence with reality.

      I agree with you on the hardware. I think the biggest obstacle to software formal correctness is the lack of formal models of which we can be confident describe our environment. (One obstacle is the IP laws - companies do not like to share the models of things they produce.)

    • Right, the problem is that one can't formally describe the environment, and any one thing has to interact with it. So formallness goes right out the window.

I got told to use these words back in uni

verification - Are we building the software right?

validation - Are we building the right software?

makes many a thing easier to talk about at work

  • Just by reading the headlines I immediately suspected the topic of Verification vs Validation would be involved. I still cannot comprehend why it is still such a gap in many software projects I have worked in. Everyone knows about testing but barely a few understand or want to understand why Validation is equally important.

  • Yep... Along with this

    Verification aligns with a specification. E.g. verify if what was implemented matches what was specified.

    Validation aligns with a requirement. E.g. verify if what was implemented solves the actual problem.

    • I like to talk about goals and not requirements.

      What is your product goal? Both for the customer (solving a problem) and for the company(solve a marketing need, make money)?

      Can you proof that you are meeting both? Then you are validating your goals successfully.

      Most of the time products are missing defined goals, so lots of people have no idea what the fuck they are developing and everyone ends up with their own idea of the product.

  • While this is a good general rule of thumb, this terminology has nothing to do with how these terms are formally defined.

    Verification first and foremost concerns design and development stage, a formal definition would be something like "outputs of design and development step meet specified requirements for that step". For example, you could verify that a formal specification for a module actually implements imposed requirements. Especially in software this can get murky as you cover different phases with different tests/suites, e.g. unit, integration, e2e implicitly test different stages, yet are often part of the same testing run.

    Validation first and foremost concerns the whole system/product and fitness for market availability.

    For example you would verify that e.g. for a motor vehicle ABS functions correctly, airbags deploy in a crash and frame meets passenger safety requirements in a crash, and you would still not be able to sell such vehicle. You would validate the vehicle as a whole with corresponding regulatory body to get vehicle deemed fit for market.

    TLDR: Verification is getting passing grade in a lab test, validation is submitting all relevant lab test protocols to get CE certified.

  • There is an important and distinct pair of definitions used by a possibly smaller but significant number of people:

    Verification: formal theoretical proof

    Validation: Empirical test based approach

  • If it's hard to remember which is which, maybe they should be different words.

    Like "Customer validation" and "Verification against the spec"

    Like "sympathy" and "empathy". I finally heard a mnemonic but if I need a mnemonic maybe it's a sign that the words are just too hard to understand

    • I just don't bother to shorten it...

      - Did we correctly understand and build what the customer asked for

      - Did the customer actually ask for what would solve the problem they are trying to solve

      The second one is very important to ask early on, and _not_ asking it can cause a lot of wasted time. When I'm in a position to comment on whether I think someone fits the requirements of a senior developer, whether or not they ask that question plays directly into it.

    • Etymology to the rescue. "Valid" and "value" share a root, and so do "verify" and "verity." You validate something to see if it has value, meaning it works and is fit for purpose. You verify it to check that it is true and meets its specifications. Does that help? :)

This was very nice, well-written and good-looking (to me) content. Thanks!

There was some meta-bug with encoding which almost is like an internal joke but I don't think it's intentional. The Haskell code examples have:

    inc :: Int -> Int

several times, and that `&gt;` entity should very likely just be a `<` in the source code.

Many interesting statements aren't a property of the code alone. They're a property of the code when it's run in a particular environment. If you want the proof to be portable then it should only make assumptions that are true in any environment.

Obligatory quote:

"Beware of bugs in the above code; I have only proved it correct, not tried it." - Donald Knuth

Source: https://staff.fnwi.uva.nl/p.vanemdeboas/knuthnote.pdf

From a quick look at it, it seems to be a "case 1" potential error, as the proof is done by hand. That is the warning is likely because it is a mathematical exercise first, and the code is for illustration and not for inclusion in production software.

>properties like all x in Int: inc(x) > x.

Why not just use an assertion? It costs more at runtime, but it's far less dev time than developing a proof, right?

  • It's not a substitute, an assertion only makes sure that it doesn't happen unnoticed, a proof says that it doesn't happen at all. I might prefer to know that my reactor control system has no overflow ahead of time, rather than getting an assertion and crashing if it happens (though it would be a bad reactor control system if it couldn't handle crashes).

  • Nearly everything that goes wrong for me in prod is already a crash/stacktrace, which is what an assertion does.

    The point of proofs is to not get yourself (or the customer) into that point in the first place.

  • I think other responses here may miss an important line of thought.

    There seems to be confusion around the differences between bug and correct.

    In the referenced study on binary searches, the Google researchers stated that most implementations had bugs AND were "broken" because they did:

        int mid =(low + high) / 2;
    

    And low + high can overflow.

    Ostensibly, a proof may require mid to be positive at all times and therefore be invalid because of overflow.

    The question is: if we put in assertions everywhere that guarantee correct output, where aborting on overflow is considered correct, then are folks going to say that's correct, or are they going to say oh that's not what binary search is so you haven't implemented it?

    EDIT: thinking about it more, it seems that null must be an accepted output of any program due to the impossibility of representing very large numbers.

  • Correct. You wouldn't prove everything in a typical production system unless that system is controlling a rocket or a pacemaker.

    You may want to prove some things. Test others.

    Types are proofs by the way. Most programmers find that handy.

    I think assertions are rediculously underused BTW

  • If you're flying a plane, you probably don't want safety critical code to throw an assertion. It's for this reason that Airbus spends a bunch of money & time on proving their flight control software.

  • Better to use a type which cannot represent an invalid state.

    For example, instead of defining `inc` with `Int`s, the equivalent of an `IncrementalInt` type parameter for the `inc` function will ensure correctness without requiring runtime assertions.

Is asserting the assumptions during code execution not standard practice for formally verified code?

  • By "asserting X" do you mean "checking whether X is true and crashing the program if not", like the assert macro in C or the assert statement in Python? No, that is almost never done, for three reasons:

    • Crashing the program is often what you formally verified the program to prevent in the first place! A crashing program is what destroyed Ariane 5 on its maiden flight, for example. Crashing the program is often the worst possible outcome rather than an acceptable one.

    • Many of the assumptions are not things that a program can check are true. Examples from the post include "nothing is concurrently modifying [variables]", "the compiler worked correctly, the hardware isn't faulty, and the OS doesn't mess with things," and, "unsafe [Rust] code does not have [a memory bug] either." None of these assumptions could be reliably verified by any conceivable test a program could make.

    • Even when the program could check an assumption, it often isn't computationally feasible; for example, binary search of an array is only valid if the array is sorted, but checking that every time the binary search routine is invoked would take it from logarithmic time to linear time, typically an orders-of-magnitude slowdown that would defeat the purpose of using a binary search instead of a simpler sequential search. (I think Hillel tried to use this example in the article but accidentally wrote "binary sort" instead, which isn't a thing.)

    When crashing the program is acceptable and correctness preconditions can be efficiently checked, postconditions usually can be too. In those cases, it's common to use either runtime checks or property-based testing instead of formal verification, which is harder.

    • This becomes an interesting conversation then. First of all, it could mean "checking whether X is true and logging an error" instead of exiting the program.

      - But if you aren't comfortable crashing the program if the assumptions are violated, then what is your formal verification worth? Not much, because the formal verification only holds if the assumptions hold, and you are indicating you don't believe they will hold.

      - True, some are infeasible to check. In that case, you could then check them weakly or indirectly. For example, check if the first two indices of the input array are not sorted. You could also check them infrequently. Better to partially check your assumptions than not check at all.

      4 replies →

  • That’s impractical. Take binary search and the assumption the list is sorted. Verifying the list is sorted would negate the point of binary search as you would be inspecting every item in the list.

    • There is a useful middle ground here. When picking the middle element, verify that it is indeed within the established bounds. This way, you'll still catch the sort order violations that matter without making the whole search linear.

    • ASSERTING the list is sorted as an assumption is significantly different form VERIFYING that the list is sorted before executing the search. Moreover, type systems can track that a list was previously sorted and maintained it's sorted status making the assumption reasonable to state.

      8 replies →

    • Only if you verify it for every search. If you haven't touched the list since the last search, the verification is still good. For some (not all) situations, you can verify the list at the start of the program, and never have to verify it again.

  • How would that look like if you accidentally assumed you have arbitrary large integers but in practice you have 64 bits?

Formal verification is great, but cosmic rays don't read our specs.

  • They do if they’re in the specs. See near-universal use of lockstep processors, ECC etc in safety-critical and high radiation cases, which comes from the two requirements that “a single bit flip from a cosmic ray shall be detectable” and “multiple simultaneous hit flips from cosmic rays shall be shown to be sufficiently rare in the operating environment.”

No hardware failure is considered? No cosmic rays flipping bits? No soft or hard real-time guarantees are discussed? What about indeterminate operations that can fail such as requesting memory from some operating system dynamically?

I'm asking because I thought high integrity systems are generally evaluated and certified as a combination of hardware and software. Considering software alone seems pretty useless.

  • Specifications that are formally verified can definitely cover real-time guarantees, behaviour under error returns from operations like allocation, and similar things. Hardware failures can be accounted for in hardware verification, which is much like software verification: specification + hardware design = verified design; if the spec covers it, the verification guarantees it.

    Considering software alone isn't pretty useless, nor is having the guarantee that "inc x = x - 1" will always go from an Int to an Int, even if it's not "fully right" at least trying to increment a string or a complex number will be rejected at compile time. Giving up on any improvements in the correctness of code because it doesn't get you all the way to 100% correct is, IMO, defeatist.

    (Giving up on it because it has diminishing returns and isn't worth the effort is reasonable, of course!)

    • Hardware verification doesn't prevent hardware failures. There is a reason RAM comes with ECC. It's not because RAM designers are too lazy to do formal verification. Even with ECC RAM, bit flips can still happen if multiple bits flip at the same time.

      There are also things like CPUs taking the wrong branch that occasionally happen. You can't assume that the hardware will work perfectly in the real world and have to design for failure.

      6 replies →

  • Nowhere does the article claim that:

       "formal verification of the code" -> "high integrity system"
    

    Formal verification is simply a method of ensuring your code behaves how you intend.

    Now, if you want to formally verify your program can tolerate any number of bits flip on any variables at any moment(s) in time, it will happily test this for you. Unfortunately, assuming presently known software methods, this is an unmeetable specification :)

  • Even then you have other physical issues to consider. This is one of the things I love about the Space Shuttle. It had 5 computers for redundancy during launch and return. You obviously don't want to put them all in the same place so you spread them out among the avionics bays. You also obviously don't want them all installed in the same orientation so you install them differently with respect to the vehicles orientation. You also have a huge data network that requires redundancy and you take all the same steps with the multiplexers as well.

    • The best example on the Shuttle were the engine control computers. Each engine had two controllers, primary and backup, each with its own set of sensors in the engine itself and each consisting of a lock-step pair of processors. For each engine, the primary controller would use processors built by one supplier, while the backup would use processors of the same architecture but produced by an entirely different supplier (Motorola and TRW).

      Today, even fairly standard automotive ECUs use dual-processor lock-step systems; a lot the the Cortex-R microcontrollers on the market are designed around enabling dual-core lock-step use, with error/difference checking on all of the busses and memory.

      3 replies →

  • For portable libraries and apps, there's only so much you can do. However, there are some interesting properties you can prove assuming the environment behaves according to a spec.

  • Side channels? Is best out of 2 sufficient or is best out of 3 necessary?

    From https://westurner.github.io/hnlog/#search:awesome-safety :

    awesome-safety-critical: https://awesome-safety-critical.readthedocs.io/en/latest/

    Hazard (logic) https://en.wikipedia.org/wiki/Hazard_(logic)

    Hazard (computer architecture); out-of-order execution and delays: https://en.wikipedia.org/wiki/Hazard_(computer_architecture)

    Soft error: https://en.wikipedia.org/wiki/Soft_error

    SEU: Single-Event Upset: https://en.wikipedia.org/wiki/Single-event_upset

    And then cosmic ray and particle physics

This is incredible. This post led me to your GitHub, which is full of similarly incredible content. “Awesome Cold Showers”? Beautiful.