Comment by nl

2 months ago

I don't understand what your argument is.

It seems to be "people keep saying the models are good"?

That's true. They are.

And the reason people keep saying it is because the frontier of what they do keeps getting pushed back.

Actual, working, useful code completion in the GPT 4 days? Amazing! It could automatically write entire functions for me!

The ability to write whole classes and utility programs in the Claude 3.5 days? Amazing! This is like having a junior programmer!

And now, with Opus 4.5 or Codex Max or Gemini 3 Pro we can write substantial programs one-shot from a single prompt and they work. Amazing!

But now we are beginning to see that programming in 6 months time might look very different to now because these AI system code very differently to us. That's exactly the point.

So what is it you are arguing against?

I think you said you didn't like that people are saying the same thing, but in this post it seems more complicated?

18 comments

timr 2 months ago

> And now, with Opus 4.5 or Codex Max or Gemini 3 Pro we can write substantial programs one-shot from a single prompt and they work. Amazing!

People have been doing this parlor trick with various "substantial" programs [1] since GPT 3. And no, the models aren't better today, unless you're talking about being better at the same kinds of programs.

[1] If I have to see one more half-baked demo of a running game or a flight sim...

simonw 2 months ago
"And no, the models aren't better today"
Can you expand on that? It doesn't match my experience at all.
- timr 2 months ago
  
  It’s a vague statement that I obviously cannot defend in all interpretations, but what I mean is: the performance of models at making non-trivial applications end-to-end, today, is not practically better than it was a few years ago. They’re (probably) better at making toys or one-shotting simple stuff, and they can definitely (sometimes) crank out shitty code for bigger apps that “works”, but they’re just as terrible as ever if you actually understand what quality looks like and care to keep your code from descending into entropy.
  I think "substantial" is doing a lot of heavy lifting in the sentence I quoted. For example, I’m not going to argue that aspects of the process haven’t improved, or that Claude 4.5 isn't better than GPT 4 at coding, but I still can’t trust any of the things to work on any modestly complex codebase without close supervision, and that is what I understood the broad argument to be about. It's completely irrelevant to me if they slay the benchmarks or make killer one-shot N-body demos, and it's marginally relevant that they have better context windows or now hallucinate 10% less often (in that they're more useful as tools, which I don't dispute at all), but if you want to claim that they're suddenly super-capable robot engineers that I can throw at any "substantial" problem, you have to bring evidence, because that's a claim that defies my day-to-day experience. They're just constantly so full of shit, and that hasn't changed, at all.
  FWIW, this line of argument usually turns into a mott and bailey fallacy, where someone makes an outrageous claim (e.g. "models have recently gained the ability to operate independently as a senior engineer!"), and when challenged on the hyperbole, retreats to a more reasonable position ("Claude 4.5 is clearly better than GPT 3!"), but with the speculative caveat that "we don't know where things will be in N years". I'm not interested in that kind of speculation.
  
  13 replies →

pianopatrick 2 months ago

Is there an endpoint for AI improvement? If we can go from functions to classes to substantial programs then it seems like just a few more steps to rewriting whole software products and putting a lot of existing companies out of business.

"AI, I don't like paying for my SAP license, make me a clone with just the features I need".

godelski 2 months ago

Two things seem to be in contention:

  - Models keep getting better[0]
  - Models since GPT 3 are able to replace junior developers

It's true that both of these can be true at the same time but they are still in contention. We're not seeing agents ready to replace mid level engineersand quite frankly I've yet to see a model actually ready to replace juniors. Possibly low end interns but the major utility of interns is to trial run employment. Frankly it still seems like interns and juniors are advancing faster than these models in the type of skills that matter for companies (not to mention that institutional knowledge is quite valuable). But there's interns that started when GPT 3.5 came out that are seniors now.

The problem is we've been promised that these employees would be replaced[1] any day now, yet that's not happening.

People forget, it is harder to advance when you're already skilled. It's not hard to go from non-programmer to a junior level. Hard to go from junior to senior. And even harder to advance to staff. The difficulty level only increases. This is true for most skills and this is where there's a lot of naivity. We can be advancing faster while the actual capabilities begin to crawl forward rather than leap.

[0] Implication is not just at coding test style questions but also in more general coding development.

[1] Which has another problem in the pipeline. If you don't have junior devs and are unable to replace both mid and seniors by the time that a junior would advance to a senior then you have built a bubble. There's a lot of big bets being made that this will happen yet the evidence is not pointing that way.