Comment by esjeon
6 days ago
> For Problem 5, models often identified the correct strategies but failed to prove them, which is, ironically, the easier part for an IMO participant. This contrast ... suggests that models could improve significantly in the near future if these relatively minor logical issues are addressed.
Interesting but I'm not sure if this is really due to "minor logical issues". This sounds like a failure due to the lack of the actual understanding (the world model problem). Perhaps the actual answers from AIs might have some hints, but I can't find them.
(EDIT: ooops, found the output on the main page of their website. Didn't expect that.)
> Best-of-n is Important ... the models are surprisingly effective at identifying the relative quality of their own outputs during the best-of-n selection process and are able to look past coherence to check for accuracy.
Yes, it's always easier to be a backseat driver.
>Yes, it's always easier to be a backseat driver
Any model that can identify the correct answer reliably can arrive at the correct answer given enough time and stochasticity.
Yes, monkeys could write Shakespeare works given enough time.
But in this case, it is really hard to know if a model is identifying "correct answers" reliably. A lot of answers are really hard to qualify as correct or not when written by humans, much more when written by a machine trying to trick readers into thinking the answer is correct. It can be done, but I doubt LLM are being trained to identify the subtle differences between those types of potential answers.
NP