Comment by viraptor
16 hours ago
Based on quite a few comments recently, it also looks like many have tried LLMs in the past, but haven't seriously revisited either the modern or more expensive models. And I get it. Not everyone wants to keep up to date every month, or burn cash on experiments. But at the same time, people seem to have opinions formed in 2024. (Especially if they talk about just hallucinations and broken code - tell the agent to search for docs and fix stuff) I'd really like to give them Opus 4.5 as an agent to refresh their views. There's lots to complain about, but the world has moved on significantly.
This has been the argument since day one. You just have to try the latest model, that's where you went wrong. For the record I use Claude Code quite a bit and I can't see much meaningful improvements from the last few models. It is a useful tool but it's shortcomings are very obvious.
Just last week Opus 4.5 decided that the way to fix a test was to change the code so that everything else but the test broke.
When people say ”fix stuff” I always wonder if it actually means fix, or just make it look like it works (which is extremely common in software, LLM or not).
Sure, I get an occasional bad result from Opus - then I revert and try again, or ask it for a fix. Even with a couple of restarts, it's going to be faster than me on average. (And that's ignoring the situations where I have to restart myself)
Basically, you're saying it's not perfect. I don't think anyone is claiming otherwise.
The problem is it’s imperfect in very unpredictable ways. Meaning you always need to keep it on a short leash for anything serious, which puts a limit on the productivity boost. And that’s fine, but does this match the level of investment and expectations?
It’s not about being perfect, it’s about not being as great as the marketing, and many proponents, claim.
The issue is that there’s no common definition of ”fixed”. ”Make it run no matter what” is a more apt description in my experience, which works to a point but then becomes very painful.
What did Opus do when you told it that it shouldn't have done that?
Nice. Did it realize the mistake and corrected it?
Nope, I did get a lot of fancy markdown with emojis though so I guess that was a nice tradeoff.
In general, even with access to the entire code base (which is very small), I find the inherent need in the models to satisfy the prompter to be their biggest flaw since it tends to constantly lead down this path. I often have to correct over convoluted SQL too because my problems are simple and the training data seems to favor extremely advanced operations.