← Back to context

Comment by xnorswap

12 hours ago

Perhaps it's specific because it's Opus 4.6, released February 5th.

https://www.anthropic.com/news/claude-opus-4-6

Opus 4.5 to 4.6 was pretty incremental, I didn't see much of a difference.

The big coding model moments in recent recollection, IMO, were something like:

- Sonnet 3.5 update in October 2024: ability to generate actually-working code using context from a codebase became genuinely feasible.

- Claude 4 release in May 2025: big tool calling improvements meant that agentic editors like Claude Code could operate on a noticeably longer leash without falling apart.

- Gemini 3 Pro, Claude 4.5, GPT 5.2 in Nov/Dec 2025: with some caveats these were a pretty major jump in the difficulty and scale of tasks that coding assistants are able to handle, working on much more complex projects over longer time scales without supervision, and testing their own work effectively.

  • Maybe they're like me, who didn't spend a lot of time investigating Claude until 4.6 launched and the hype was enough to be the tipping point to invest energy. I do know that I've been having good/great results with Opus 4.6 and the CLI, but after an hour or so, it'll suddenly forget that the codebase has tab-formatted files and burn up my quota trying to figure out how to read text files. And apparently this snafu has been around since at least late last year [0]. Again, I can't complain about the overall speed and quality for my relatively light projects, I'm just fascinated by people who say their agents can get through a whole weekend without supervision, when even 4.6 appears to randomly get tripped up in a very rookie way?

    [0] https://github.com/anthropics/claude-code/issues/11447

    • There's definitely a productivity curve element to getting it to behave effectively within a given codebase. Certainly in the codebases I work with most frequently I find Claude will forget certain key aspects (how to run the tests or something) after a while and need a reminder, otherwise it gets into a loop like that trying to figure out how to do it from first principles with slightly incorrect commands.

      I think a lot of the noise about letting Claude run for very extended periods involves relatively greenfield projects where the AI is going to be using tools and patterns and choices that are heavily represented in training data (unless you tell it not to), which I think are more likely to result in a codebase that lends itself to ongoing AI work. People also just exaggerate and talk about the one time doing that actually worked vs the 37 times Claude required more handholding.

      The bigger problem I see with the "leave it running for the weekend" type work is that, even if it doesn't get caught up on something trivial like tabs vs spaces (glad we're keeping that one alive in the AI era, lol), it will accumulate bad decisions about project structure/architecture/design that become really annoying to untie, and that amount to a flavor of technical debt that makes it harder for agents themselves to continue to make forward progress. Lots of insidious little things: creating giant files that eventually create context problems, duplicating important methods willy nilly and modifying them independently so their implementations drift apart, writing tests that are..."designed to pass" in a way that creates a false sense of confidence when they're passing, and "forest for the trees" kind of issues where the AI gets the logic right inside a crucial method so it looks good at a glance, but it misses some kind of bigger picture flaw in the way the rest of the code actually uses that method.