Comment by zamalek

5 days ago

Other comments indicate that asking it to do long multiplication does work, but the varying results makes sense: LLMs are probabilistic, you probably rolled an unlikely result.

Specifically, you need to use a reasoning model. Applying more test time compute is analogous to Kahneman's System 2 thinking, while directly taking the first output of an LLM is analogous to System 1.

This is true for solving difficult novel problems as well, with the addition of tools that an agent can use to research the problem autonomously.