← Back to context

Comment by prodigycorp

1 day ago

You’re not wrong but I still think that the harness matters a lot when trying to accurately describe Claude Code.

Here’s a reframing:

If you asked people “what would you rather work with, today’s Claude Code harness with sonnet 3.7, or the 200 line agentic loop in the article with Opus 4.5, which would you choose?”

I suspect many people would choose 3.7 with the harness. Moreover, that is true, then I’d say the article is no longer useful for a modern understanding of Claude Code.

I don't think so, model improvements far outweigh any harness or tooling.

Look at https://github.com/SWE-agent/mini-swe-agent for proof

  • Yes but people aren’t choosing CC because they are necessarily performance maximalists. They choose it because it has features that make it behave much more nicely as a pair programming assistant than mini-swe-agent.

    There’s a reason Cursor poached Boris Cherney and Cat Wu and Anthropic hired them back!

Any person who would choose 3.7 with a fancy harness has a very poor memory about how dramatically the model capabilities have improved between then and now.

  • I’d be very interested in the performance of 3.7 decked out with web search, context7, a full suite of skills, and code quality hooks against opus 4.5 with none of those. I suspect it’s closer than you think!

    • Skills don't make any difference above having markdown files to point an agent to with instructions as needed. Context7 isn't any better than telling your agent to use trafilatura to scrape web docs for your libs, and having a linting/static analysis suite isn't a harness thing.

      3.7 was kinda dumb, it was good at vibe UIs but really bad at a lot of things and it would lie and hack rewards a LOT. The difference with Opus 4.5 is that when you go off the Claude happy path, it holds together pretty well. With Sonnet (particularly <=4) if you went off the happy path things got bad in a hurry.

      1 reply →

    • > I suspect it’s closer than you think!

      It's not.

      I've done this (although not with all these tools).

      For a reasonable sized project it's easy to tell the difference in quality between say Grok-4.1-Fast (30 on AA Coding Index) and Sonnet 4.5 (37 on AA).

      Sonnet 3.7 scores 27. No way I'm touching that.

      Opus 4.5 scores 46 and it's easy to see that difference. Give the models something with high cyclomtric complexity or complex dependency chains and Grok-4.1-Fast falls to bits, Opus 4.5 solves things.

This is SO wrong.

I actually wrote my own simple agent (with some twists) in part so I could compare models.

Opus 4.5 is in a completely different league to Sonnet 4.5, and 3.7 isn't even on the same planet.

I happily use my agent with Opus but there is no world in which I'd use a Sonnet 3.7 level model for anything beyond simple code completion.