Comment by pron

10 hours ago

A fundamental problem is that program verification is intractable in the computational complexity sense. The question of whether a program satisfies a spec or not is called in the literature the model checking problem (not to be confused with model checkers, which are various algorithms for automatically solving some model checking problems). In the worst case, there is no faster way to determine whether a program satisfies a spec or not than explicitly testing each of its reachable states.

The question is, of course, what portion of programs are much easier than the worst case (to the point of making verification tractable). That is not known, but the results are not necessarily encouraging. Programs that have been deductively verified proof assistants were not only very small, and not only written with proofs in mind, but were also restricted to use simple and less efficient algorithms to make the proofs doable. They tend to require between 10x and 1000x lines of proof per line of code.

(an old blog post of mine links to some important results: https://pron.github.io/posts/correctness-and-complexity)

There is a belief that programs that people write and more-or-less work should be tractably provable, as that's what allowed writing them in the first place (i.e. the authors must have had some vague proof of correctness in mind when writing them). I don't think this argument is true (https://pron.github.io/posts/people-dont-write-programs), and we use formal methods precisely when we want to close the gap between working more-or-less and definitely always working.

But even if verifying some program is tractable, it could still take a long time between iterations. Because it's reasonable that it would take an LLM no less than a month to prove a correct program, there's no point in stopping it before. So a "practical" approach could be: write a program, try proving it for a month, and if you haven't finished in a month, try again. That, of course, could mean waiting, say, six months before deciding whether or not it's likely to ultimately succeed. Nevertheless, I expect there will be cases where writing the program and the proof would have taken a team of humans 800 years, and an LLM could do it in 80.