Comment by Amekedl

14 days ago

So the wall has been really been hit already for now, ouch. It was to be expected with gpt-“4.5”, but still, the realization now really feels grounded.

It's kinda hilarious to see people claiming that the wall has been hit for the past two years, while evals are creeping up each month, particularly realistic end-to-end SWE-bench.

Have you compared GPT-4.5 to 4o?

GPT-4.5 just knows things. Some obscure programming language? It knows the syntax.

Obviously, that's not sufficient - you also need reasoning, post-training, etc. so quite predictably G2.5P being a large model + reasoning + tuning got SotA in code generation.

(FWIW I think if it was tuned for a particular input/output format it could get another 10%)

But, yeah, the wall, the wall!

  • Ever heard about benchmark contamination?

    Ever tried to explain a new concept, like a new state management store for web frontend?

    Most fail spectacularly there, sonnet 3.7 I had reasonable ""success"" with, but not 4.5. It faltered completely.

    Let’s not get ahead of ourselves. Looking at training efficiency in this now, and all the other factors, it really is difficult to paint a favorable picture atm.