Comment by josephg
2 days ago
Yes, I’m in this camp. I’ve been pushing forward some personal projects lately using LLMs. At first, I was delighted at how productive I was, prompting. But over time a lot of cracks have started to show. Claude is good at programming “in the small”. It’s good at getting self contained, well scoped tasks done. But it’s bad at large scale system thinking. Over time, every project I’ve gotten Claude to write has become riddled with poor design choices layered on top of one another until even Claude struggles to make forward progress. And at that point, what do you do? You’ve gotta read all its code. Something I’ve learned I should just do from the start to save myself a lot of time later on.
It’s also strangely bad at correctness. You can ask it to write unit tests for a project. Unless you’re careful with your prompting, it will only write the unit tests that it knows will pass.
Generally, these models are amazing tools. But they cannot be trusted to make correct, maintainable software. At least not yet. Maybe in another year or two.
> Claude is good at programming “in the small”. It’s good at getting self contained, well scoped tasks done. But it’s bad at large scale system thinking. Over time, every project I’ve gotten Claude to write has become riddled with poor design choices layered on top of one another until even Claude struggles to make forward progress.
What's really fun about these tools is that this is both true and false!
If you ask these tools to reason about things in-the-large you often get very useful information back out of them.
I've asked about refactors I was thinking about doing, and gotten accurate and useful information back, which as been AMAZINGLY helpful for avoiding "let me start doing this, and then realize 4 hours in that it's not gonna work as well as I hoped" traps.
But it's a very attention-on-one-thing-at-a-time thing. IMO this is fairly inherent to the models, but people have been doing great work making the tooling around them smarter in terms of how to break up tasks ahead of time to compensate, so I'm not gonna say it won't get materially better.
So if you prompt it to do a task in a certain way, especially in a "plan mode" type of usage, you can get a pretty solid recipe + execution of a properly-designed implementation of that task.
But if you're not opinionated and checking in frequently, you're gonna get the sorta median-approach or random-luck-output-of-the-day decision. And so the human-in-the-loop point is unlikely to go away as long as the human has more context. Even if it's half-baked or not-fully-realized intuition about how the code is likely to evolve in the future that you don't put into every prompt.
> It’s also strangely bad at correctness. You can ask it to write unit tests for a project. Unless you’re careful with your prompting, it will only write the unit tests that it knows will pass.
My hunch is that this is the same fundamental problem. When it's attention is fully on "produce the next string of code" the parts of the context that relate to the broader system goals are NOT being considered as much for the output. So you get things like this, even with latest Opus still, when dealing with hard-to-isolate-in-a-single-test bugs (esp when it comes to multi-service call sequences or concurrent code):
- "we need to fix this bug across eight methods in three files"
- "I found the spot! we need to do [blah blah blah]"
- "great, implement that plan"
- "I've done it!"
- "wait a sec... you moved some of the sequencing around, but didn't actually fix the fundamental issue"
- "you're right! i moved [xyz] into [func b] instead of [func a] since it needed to be called later, but actually it needs to be after [func c] since it depends on the output of func b!"
When asked about correctness it's good enough at "reasoning"-style output to spot these issues, but when generating code it's in such a pure "predict plausible code sequences" mode that this can get lost.
Yeah I hear you with all of this.
Its weird, working with LLMs. There are some things the LLMs are extremely good at doing autonomously. Like, reverse engineering, or reading documentation (and using that knowledge in other areas). There are things that they can do - but you need to explicitly prompt. Like, I've found Opus is quite good at optimising code. I've had a lot of success by asking it to write benchmarks (and do profiling), and use that data to improve the performance of some piece of code. Thats often enough to get quite large performance improvements. You can get even further by showing it similar code others have written which is well optimised. It's very good at copying optimisation ideas from one project to another.
But then there are very simple things it really struggles to do. Some kinds of correctness testing. Invariants. System design.
Is it bad at that stuff, or do I just need to figure out how to prompt the LLM? To return to the topic of this thread, I think we're seeing a lot of different opinions on LLM generated code for 3 reasons:
1. Some people aren't looking at claude's output at all. Some people are looking at the code and it looks fine to them. And some people (with more experience writing software) are looking at the code and judging it poorly.
2. We all prompt our LLMs very differently! It turns out that you get really different results based on how you prompt the machine. We're all figuring this out together. Some people have better instincts than others.
3. We're working on different projects. Claude is comparatively much better at end-user facing software. Its great at making a standalone website. Its much less good at finding and fixing obscure bugs in large, established pieces of software. If you work in consulting, LLMs can already do a lot of your job. If you work on Chrome or Unreal or the windows kernel, maybe not so much.
> Is it bad at that stuff, or do I just need to figure out how to prompt the LLM? To return to the topic of this thread, I think we're seeing a lot of different opinions on LLM generated code for 3 reasons:
I think there's definitely a bit of both. Some things are easier to prompt "adequately." Some domains or types of requests are tougher.
The billion dollar question (well, trillion dollar, looking at the valuations of OpenAI and Anthropic) is will that change enough to actually replace the highly-paid people who currently are needed to make sure shit doesn't go sideways? They're betting that they can solve the "turn a bad prompt into a good-enough series of prompts" problem generically for everyone.
And where exactly that lands could have a lot of knock-on effects. The easy targets are things like SaaS that is only valuable because of economies of scale but the problems are "simple" if you don't have that scale.
But even there, there's a lot of echoes of the past, where things like ad-hoc Access apps or spreadsheets powered (or still power!) a bunch of business processes in lieu of SaaS products. How much appetite is there long-term for large businesses to really go back to owning all that in-house?
Also a fun irony in that the trillion-dollar-valuation world is basically "the biggest SaaS of them all" and those companies have a huge target on them and at that point practically everyone else in the world would be gunning for them. If they do find that "good enough" point, they also have to hope that nobody can replicate it for less anytime soon... (but they'll also have given those folks aiming at doing just that a great tool for helping build those systems).