← Back to context

Comment by mkozlows

14 hours ago

It's because the models keep getting better! What you could do with GPT-4 was more impressive than what you could do with GPT 3.5. What you could do with Sonnet 3.5 was more impressive yet, and Sonnet 4, and Sonnet 4.5.

Some of these improvements have been minor, some of them have been big enough to feel like step changes. Sonnet 3.7 + Claude Code (they came out at the same time) was a big step change; Opus 4.5 similarly feels like a big step change.

(If you don't trust vibes, METR's task completion benchmark shows huge improvements, too.)

If you're sincerely trying these models out with the intention of seeing if you can make them work for you, and doing all the things you should do in those cases, then even if you're getting negative results somehow, you need to keep trying, because there will come a point where the negative turns positive for you.

If you're someone who's been using them productively for a while now, you need to keep changing how you use them, because what used to work is no longer optimal.

Models keep getting better but the argument I'm critiquing stays the same.

So does the comment I critiqued in the sibling comment to yours. I don't know why it's so hard to believe we just haven't tried. I have a Claude subscription. I'm an ML researcher myself. Trust me, I do try.

But that last part also makes me keenly aware of their limitations and failures. Frankly I don't trust experts who aren't critiquing their field. Leave the selling points to the marketing team. The engineer and researcher's job is to be critical. To find problems. I mean how the hell do you solve problems if you're unable to identify them lol. Let the marketing team lead development direction instead? Sounds like a bad way to solve problems

  > benchmark shows huge improvements

Benchmarks are often difficult to interpret. It is really problematic that they got incorporated into marketing. If you don't understand what a benchmark measures, and more importantly, what it doesn't measure, then I promise you that you're misunderstanding what those numbers mean.

For METR I think they say a lot right here (emphasis my own) that reinforces my point

  > Current frontier AIs are vastly better than humans at text prediction and knowledge tasks. They outperform experts on most *exam-style problems* for a fraction of the cost. ... And yet the best AI agents are not currently able to carry out substantive projects by themselves or directly substitute for human labor. *They are unable to reliably handle even relatively low-skill*, computer-based work like remote executive assistance. It is clear that capabilities are increasing very rapidly in some sense, but it is unclear how this corresponds to real-world impact.

So make sure you're really careful to understand what is being measured. What improvement actually means. To understand the bounds.

It's great that they include longer tasks but also notice the biases and distribution in the human workers. This is important in properly evaluating.

Also remember what exactly I quoted. For a long time we've all known that being good at leetcode doesn't make one a good engineer. But it's an easy thing to test and the test correlates with other skills that are likely to be learned to be good at those tests (despite being able to metric hack). We're talking about massive compression machines. That pattern match. Pattern matching tends to get much more difficult as task time increases but this is not a necessary condition.

Treat every benchmark adversarialy. If you can't figure out how to metric hack it then you don't know what a benchmark is measuring (and just because you know what can hack it doesn't mean you understand it nor that that's what is being measured)

  • I think you should ask yourself: If it were true that 1) these things do in fact work, 2) these things are in fact getting better... what would people be saying?

    The answer is: Exactly what we are saying. This is also why people keep suggesting that you need to try them out with a more open mind, or with different techniques: Because we know with absolute first-person iron-clad certainty what is possible, and if you don't think it's possible, you're missing something.

  • I don't understand what your argument is.

    It seems to be "people keep saying the models are good"?

    That's true. They are.

    And the reason people keep saying it is because the frontier of what they do keeps getting pushed back.

    Actual, working, useful code completion in the GPT 4 days? Amazing! It could automatically write entire functions for me!

    The ability to write whole classes and utility programs in the Claude 3.5 days? Amazing! This is like having a junior programmer!

    And now, with Opus 4.5 or Codex Max or Gemini 3 Pro we can write substantial programs one-shot from a single prompt and they work. Amazing!

    But now we are beginning to see that programming in 6 months time might look very different to now because these AI system code very differently to us. That's exactly the point.

    So what is it you are arguing against?

    I think you said you didn't like that people are saying the same thing, but in this post it seems more complicated?

    • > And now, with Opus 4.5 or Codex Max or Gemini 3 Pro we can write substantial programs one-shot from a single prompt and they work. Amazing!

      People have been doing this parlor trick with various "substantial" programs [1] since GPT 3. And no, the models aren't better today, unless you're talking about being better at the same kinds of programs.

      [1] If I have to see one more half-baked demo of a running game or a flight sim...

      3 replies →

    • Is there an endpoint for AI improvement? If we can go from functions to classes to substantial programs then it seems like just a few more steps to rewriting whole software products and putting a lot of existing companies out of business.

      "AI, I don't like paying for my SAP license, make me a clone with just the features I need".