← Back to context

Comment by lherron

9 months ago

Agreed! And with all the gaming of the evals going on, I think we're going to be stuck with anecdotal for some time to come.

I do feel (anecdotally) that models are getting better on every major release, but the gains certainly don't seem evenly distributed.

I am hopeful the coming waves of vertical integration/guardrails/grounding applications will move us away from having to hop between models every few weeks.

Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.