Comment by Amekedl

6 months ago

So the wall has been really been hit already for now, ouch. It was to be expected with gpt-“4.5”, but still, the realization now really feels grounded.

4 comments

Amekedl

killerstorm 6 months ago

It's kinda hilarious to see people claiming that the wall has been hit for the past two years, while evals are creeping up each month, particularly realistic end-to-end SWE-bench.

Have you compared GPT-4.5 to 4o?

GPT-4.5 just knows things. Some obscure programming language? It knows the syntax.

Obviously, that's not sufficient - you also need reasoning, post-training, etc. so quite predictably G2.5P being a large model + reasoning + tuning got SotA in code generation.

(FWIW I think if it was tuned for a particular input/output format it could get another 10%)

But, yeah, the wall, the wall!

Amekedl 6 months ago
Ever heard about benchmark contamination?
Ever tried to explain a new concept, like a new state management store for web frontend?
Most fail spectacularly there, sonnet 3.7 I had reasonable ""success"" with, but not 4.5. It faltered completely.
Let’s not get ahead of ourselves. Looking at training efficiency in this now, and all the other factors, it really is difficult to paint a favorable picture atm.
- killerstorm 6 months ago
  
  You sound like Gary Marcus.
  
  1 reply →