Comment by killerstorm

6 months ago

It's kinda hilarious to see people claiming that the wall has been hit for the past two years, while evals are creeping up each month, particularly realistic end-to-end SWE-bench.

Have you compared GPT-4.5 to 4o?

GPT-4.5 just knows things. Some obscure programming language? It knows the syntax.

Obviously, that's not sufficient - you also need reasoning, post-training, etc. so quite predictably G2.5P being a large model + reasoning + tuning got SotA in code generation.

(FWIW I think if it was tuned for a particular input/output format it could get another 10%)

But, yeah, the wall, the wall!

3 comments

killerstorm

Amekedl 6 months ago

Ever heard about benchmark contamination?

Ever tried to explain a new concept, like a new state management store for web frontend?

Most fail spectacularly there, sonnet 3.7 I had reasonable ""success"" with, but not 4.5. It faltered completely.

Let’s not get ahead of ourselves. Looking at training efficiency in this now, and all the other factors, it really is difficult to paint a favorable picture atm.

killerstorm 6 months ago
You sound like Gary Marcus.
- Amekedl 6 months ago
  
  Didn't know him, but he seems overly skeptical. Honestly, I was just expecting more from llama-4 than this, hence mentioning the wall. I hope it's still too early to tell, because new ideas are going to change stuff inevitably, maybe anthropic opens up more, or chinese labs keep overdelivering...