Comment by simon84

1 day ago

Nice I get it now. And from your usage, a natural question is how much effort was put by the AI to find the verified answer ? Did it need to run significantly longer or less that without these constraints ?

Yes, it runs significantly longer.

Opus 4.8 ran autonomously for about 8 hours to provide program and proof of correctness, given the formal specification. And in previous experiments, Opus 4.7 failed and I was only able to do it using that model when I cut the work for the agents in smaller steps.

Much of the work the agent is to provide the proof of correctness. The upside is that less time is needed for human review and we can guarantee the absence of bugs that might be expensive when they come up in production.

Historically formal verification was only worth it for very critical software. In the readme I reference related work from NASA, that implemented and verified a different algorithm concerned with polygons, with the intended application to compute keep out zones for autonomous vehicles. This was 2021 before capable LLMs and in the paper they mention that they manually wrote 700 lemmas to produce such a formal proof manually. I hope that as it gets cheaper now, formal verification is used more widely.