Comment by CuriouslyC

5 days ago

I dislike Anthropic but I wouldn't argue 4.8 isn't an improvement on 4.5/4.6. Your tasks just might not typically need the extra intelligence.

Opus 4.7/4.8 often over-engineers on my setups, plus:

- It talks a LOT more like GPT models. You know: wrinkle, shape, gate, coarse, scope, gap, path, production-ready-workflow-of-the-day, and so on -- "that's expected, a consequence of the previous like-driven workflow". If I wanted to get a headache using AI I would have gone with GPT in the first place!

- It outputs text in a much harder way to follow along. I can't exactly say what it is. Maybe a bit of everything? Bolds are missing, bullet points are gone, paragraphs are bland and too long, and it doesn't feel like a model programming with me, but rather a somewhat full of themselves grandpa developer looking down on me. It's very weird to describe this, but it is definitely how I feel.

Granted this can totally be because of the way it reacts to the prompts now. We've got a rather large corpus of skills and "rules and good practices" that Opus 4.6 responded to great, and maybe the new models just get turned into this when fed with them....I don't know.

Either way, with Opus 4.6 being as good as it is, I need Fable to be a significant step up to justify a price increase. if it can get me to babysit opus a little bit less on some stuff, it might be worth it. Otherwise, I'm very happy with Opus 4.6 and hope they don't deprecate it.

I'd argue that 4.8 is a straight downgrade. For every type of task I've tried. It's been a gambit at this point. If 4.6 quits being available, I'm out at this point.

Reading so many contrary positions about which model is better or worse shows how difficult it is to measure intelligence based on personal experiences. Of course, benchmarks try to make the process as objective as possible, but they often don't correlate with our personal experiences.

The other day 4.6 was fantastic for x task. Today, 4.6 overengineered everything and I had to revert all my changes. When evaluating models, perhaps it makes sense to consider luck as an ingredient before reaching any personal conclusion.

IME Opus 4.8 (and 4.7) is often a downgrade from 4.6. I find that it tends to overthink and overcomplicate things.

  • Yes but there’s a reason we don’t evaluate these models this way and instead do it as carefully and thoughtfully as we can at scale. Human evaluations are important but they are an absolute minefield of footguns. 4.8 is not a downgrade from 4.6 there is an insane amount of hard data that contradicts this.

    • Actually anecdata I gather on my job from myself and coworkers is the only benchmark I trust anymore, because it so heavily diverges from the “benchmarks”.

      5 replies →

    • "Carefully and thoughtfully" is antithetical to the approach to benchmarks these days.

      Maybe back when this was a scientific endeavor; not now when enormous, enormous amounts of capital are on the line. Along with an entire cult's chosen eschatology.

      2 replies →

    • There is no data that I would trust that contradicts it.

      Frankly I don't give a damn about data that could be made up on the spot or appears to be scientific or meaningful while it's not at all clear how it was made (up).

      Claude was heavily lobotomised for my work starting somewhen in February.

      I talked to friends and people I know and trust and many felt the same. (I didn't ask them whether they felt like I did, but what they felt, how happy they were with agentic coding etc.)

      I quit my abo in March and talked to said friends who are still on a plan just last week: they are still not happy, but company pays so whatever...

      6 replies →

    • Seems like a bunch of noise. What does this even mean?

      It sounds like you're saying "Actually you, as a human, are simply not smart enough to evaluate Opus 4.8"

      1 reply →

  • "Fable 5" is Opus 4.7, and the Opus 4.7 we got is a Sonnet sized model on a stronger base.

    That's where all the regressions and inconsistency in experiences stem from: RL can still only go so far vs having more parameters

Lol. If you're doing anything non trivial that's not a CRUD webapp but e.g. some physics simulation or high performance GPU code any and all models I've tried suck.

They are not just leagues behind what experts would code, they are not even playing the same game.

Which is to be expected, as there isn't so much physics or high performance gpu code available as there is for your typical CRUD API and JS frontend.

  • I can attest to this, I had a very simple 20-line shader that I asked Claude to do a basic 90-degree rotation on it, and it just completely got it wrong. Frequently adds pointless abstractions / intermediate variables even when I tell it explicitly not to in the system prompt. I can go on and on, these things just don't understand architecture. And why would they? They were trained on text.

    There is something remarkable about turning speech into code (don't need to hunch over a keyboard nearly as much these days, can just talk into a mic) and it's good for first drafts / exploring ideas. But it's obvious to anyone that's paying attention we're hitting the top of the S-curve. It's no wonder the IPOs are around the corner. I mean even Dario admitted he doesn't know how they're gonna substantially increase the context window size. That says a lot.

    • That being said I think the harnesses are only getting better. And maybe we will get multi-modal models that understand architecture eventually. But the growing-the-blob-of-text training method that's being used now appears to be getting diminishing returns