Comment by evenhash
2 days ago
Generally when people talk about using LLMs to do mathematics research they’re not talking about the LLM alone, but the LLM + a harness for it to write and execute theorem provers such as Lean or Coq to validate their results.
I guess I just don’t have the experience or optimism that a harness around an LLM, which can’t make the first, bare deduction on its own, is a good use of compute.
I got out of RLHF, including games and puzzles, before agents took off and maybe I have outdated info. But we estimated RLHF’ing a single hard full sized sudoku was ~25 hours worth of work.