Aristotle's output is formally verified in Lean, so you can run it for days on a hard problem and be assured that the answer, no matter how complex, is right without needing to manually check it.
Claude Code can write lean, but we do a heck of a lot of RL on theorem proving, so Aristotle winds up being much better at writing Lean than other coding agents are.
Aristotle's output is formally verified in Lean, so you can run it for days on a hard problem and be assured that the answer, no matter how complex, is right without needing to manually check it.
Claude Code can write lean, but we do a heck of a lot of RL on theorem proving, so Aristotle winds up being much better at writing Lean than other coding agents are.
Seeing a task-specific model be consistently better at anything is extremely surprising given rapid innovation in foundation models.
Have you tried Aristotle on other, non-Lean tasks? Is it better at logical reasoning in general?
Is it though? There is a reason gpt has codex variants. RL on a specific task raises the performance on that task
1 reply →
how strong is your internal informal LLM at theorem-proving before the formalization stage? or it's combined in a way so that is not measurable?