Comment by josephg

2 days ago

Yeah I hear you with all of this.

Its weird, working with LLMs. There are some things the LLMs are extremely good at doing autonomously. Like, reverse engineering, or reading documentation (and using that knowledge in other areas). There are things that they can do - but you need to explicitly prompt. Like, I've found Opus is quite good at optimising code. I've had a lot of success by asking it to write benchmarks (and do profiling), and use that data to improve the performance of some piece of code. Thats often enough to get quite large performance improvements. You can get even further by showing it similar code others have written which is well optimised. It's very good at copying optimisation ideas from one project to another.

But then there are very simple things it really struggles to do. Some kinds of correctness testing. Invariants. System design.

Is it bad at that stuff, or do I just need to figure out how to prompt the LLM? To return to the topic of this thread, I think we're seeing a lot of different opinions on LLM generated code for 3 reasons:

1. Some people aren't looking at claude's output at all. Some people are looking at the code and it looks fine to them. And some people (with more experience writing software) are looking at the code and judging it poorly.

2. We all prompt our LLMs very differently! It turns out that you get really different results based on how you prompt the machine. We're all figuring this out together. Some people have better instincts than others.

3. We're working on different projects. Claude is comparatively much better at end-user facing software. Its great at making a standalone website. Its much less good at finding and fixing obscure bugs in large, established pieces of software. If you work in consulting, LLMs can already do a lot of your job. If you work on Chrome or Unreal or the windows kernel, maybe not so much.

> Is it bad at that stuff, or do I just need to figure out how to prompt the LLM? To return to the topic of this thread, I think we're seeing a lot of different opinions on LLM generated code for 3 reasons:

I think there's definitely a bit of both. Some things are easier to prompt "adequately." Some domains or types of requests are tougher.

The billion dollar question (well, trillion dollar, looking at the valuations of OpenAI and Anthropic) is will that change enough to actually replace the highly-paid people who currently are needed to make sure shit doesn't go sideways? They're betting that they can solve the "turn a bad prompt into a good-enough series of prompts" problem generically for everyone.

And where exactly that lands could have a lot of knock-on effects. The easy targets are things like SaaS that is only valuable because of economies of scale but the problems are "simple" if you don't have that scale.

But even there, there's a lot of echoes of the past, where things like ad-hoc Access apps or spreadsheets powered (or still power!) a bunch of business processes in lieu of SaaS products. How much appetite is there long-term for large businesses to really go back to owning all that in-house?

Also a fun irony in that the trillion-dollar-valuation world is basically "the biggest SaaS of them all" and those companies have a huge target on them and at that point practically everyone else in the world would be gunning for them. If they do find that "good enough" point, they also have to hope that nobody can replicate it for less anytime soon... (but they'll also have given those folks aiming at doing just that a great tool for helping build those systems).