← Back to context

Comment by techpression

12 hours ago

Just last week Opus 4.5 decided that the way to fix a test was to change the code so that everything else but the test broke.

When people say ”fix stuff” I always wonder if it actually means fix, or just make it look like it works (which is extremely common in software, LLM or not).

Sure, I get an occasional bad result from Opus - then I revert and try again, or ask it for a fix. Even with a couple of restarts, it's going to be faster than me on average. (And that's ignoring the situations where I have to restart myself)

Basically, you're saying it's not perfect. I don't think anyone is claiming otherwise.

  • The problem is it’s imperfect in very unpredictable ways. Meaning you always need to keep it on a short leash for anything serious, which puts a limit on the productivity boost. And that’s fine, but does this match the level of investment and expectations?

  • It’s not about being perfect, it’s about not being as great as the marketing, and many proponents, claim.

    The issue is that there’s no common definition of ”fixed”. ”Make it run no matter what” is a more apt description in my experience, which works to a point but then becomes very painful.

Nice. Did it realize the mistake and corrected it?

  • Nope, I did get a lot of fancy markdown with emojis though so I guess that was a nice tradeoff.

    In general, even with access to the entire code base (which is very small), I find the inherent need in the models to satisfy the prompter to be their biggest flaw since it tends to constantly lead down this path. I often have to correct over convoluted SQL too because my problems are simple and the training data seems to favor extremely advanced operations.