Comment by juujian
1 year ago
I often run into LLMs writing "beginner code" that uses the most fundamental findings in really impractical ways. Trained on too many tutorials I assume.
Usually, specifying the packages to use and asking for something less convoluted works really well. Problem is, how would you know if you have never learned to code without an LLM?
>I often run into LLMs writing "beginner code" that uses the most fundamental findings in really impractical ways. Trained on too many tutorials I assume.
In the absence of any other context, that's probably a sensible default behaviour. If someone is just asking "write me some code that does x", they're highly likely to be a beginner and they aren't going to be able to understand or reason about a more sophisticated approach. IME LLMs will very readily move away from that default if you provide even the smallest amount of context; in the case of this article, even by doing literally the dumbest thing that could plausibly work.
I don't mean to cast aspersions, but a lot of criticisms of LLMs are really criticising them for not being psychic. LLMs can only respond to the prompt they're given. If you want highly optimised code but didn't ask for it, how is the LLM supposed to know that's what you wanted?
In my experience the trouble with LLMs at the professional level is that they're almost as much work to prompt to get the right output as it would be to simply write the code. You have to provide context, ask nicely, come up with and remind it about edge cases, suggest which libraries to use, proofread the output, and correct it when it inevitably screws up anyway.
I use Copilot for autocomplete regularly, and that's still the peak LLM UX for me. I prompt it by just writing code, it automatically pulls into context the file I'm working on and imported files, it doesn't insist on writing an essay explaining itself, and it doesn't get overly ambitious. And in addition to being so much easier to work with, I find it still produces better code than anything I get out of the chat models.
After 6 months of co-pilot autocomplete in my text editor feeling like an uninformed back seat driver with access to the wheel, I turned it off yesterday.
It’s night and day to what I get from Claude sonnet 3.5 in their UI, and even then only on mainstream languages.
2 replies →
> In my experience the trouble with LLMs at the professional level is that they're almost as work to prompt to get the right output as it would be to simply write the code.
Yeah. It's often said that reading (and understanding) code is often harder than writing new code, but with LLMs you always have to read code written by someone else (something else).
There is also the adage that you should never write the most clever code you can, because understanding it later might prove too hard. So it's probably for the best that LLM code often isn't too clever, or else novices unable to write the solution from scratch will also be unable to understand it and assess whether it actually works.
1 reply →
It depends on what you’re doing. I’ve been using Claude to help me write a web admin interface to some backend code I wrote. I haven’t used react since it first came out (and I got a patch randomly in!)… it completely wrote a working react app. Yes it sometimes did the wrong thing, but I just kept correcting it. I was able in a few hours to do something that would have taken me weeks to learn and figure out. I probably missed out on learning react once again, but the time saved on a side project was immense! And it came up with some pretty ok UI I also didn’t have to design!
Even as someone with plenty of experience, this can still be a problem: I use them for stuff outside my domain, but where I can still debug the results. In my case, this means I use it for python and web frontend, where my professional experience has been iOS since 2010.
ChatGPT has, for several generations, generally made stuff that works, but the libraries it gives me are often not the most appropriate, and are sometimes obsolete or no longer functional — and precisely because web and python are hobbies for me rather than my day job, it can take me a while to spot such mistakes.
Two other things I've noticed, related in an unfortunate way:
1) Because web and python not my day job, more often than not and with increasing frequency, I ultimately discover that when I disagree with ChatGPT, the AI was right and I was wrong.
2) These specific models often struggle when my response has been "don't use $thing or $approach"; unfortunately this seems to be equally applicable regardless of if the AI knew more than me or not, so it's not got predictive power for me.
(I also use custom instruction, you YMMV)
I wish people would understand what a large language model is. There is no thinking. No comprehension. No decisions.
Instead, think of your queries as super human friendly SQL.
The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.
So how much code is on the web for a particular problem solve? 10k blog entries, stackoverflow responses? What you get back is mishmash of these.
So it will have decade old libraries, as lots of those scraped responses are 10 years old, and often without people saying so.
And it will likely have more poor code examples than not.
I'm willing to bet that OpenAI's ingress of stackoverflow responses stipulated higher priority on accepted answers, but that still leaves a lot of margin.
And how you write your query, may sideline you into responses with low quality output.
I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.
And I've seen some pretty poor code examples out there.
> Instead, think of your queries as super human friendly SQL.
> The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.
This is a useful model for LLMs in many cases, but it's also important to remember that it's not a database with perfect recall. Not only is it a database with a bunch of bad code stored in it, it samples randomly from that database on a token by token basis, which can lead to surprises both good and bad.
> There is no thinking. No comprehension. No decisions.
Re-reading my own comment, I am unclear why you think it necessary to say those specific examples — my descriptions were "results, made, disagree, right/wrong, struggle": tools make things, have results; engines struggle; search engines can be right or wrong; words can be disagreed with regardless of authorship.
While I am curious what it would mean for a system to "think" or "comprehend", every time I have looked at such discussions I have been disappointed that it's pre-paradigmatic. The closest we have is examples such as Turing 1950[0] saying essentially (to paraphrase) "if it quacks like a duck, it's a duck" vs. Searle 1980[1] which says, to quote the abstract itself, "no program by itself is sufficient for thinking".
> I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.
All of maths can be derived from the axioms of maths. All chess moves derive from the rules of the game. This kind of process has a lot of legs, regardless of if you want to think of the models as "thinking" or not.
Me? I don't worry too much if they can actually think, not because there's no important philosophical questions about what that even means, but because other things have a more immediate impact: even if they are "just" a better search engine, they're a mechanism that somehow managed to squeeze almost all of the important technical info on the internet into something that fits into RAM on a top-end laptop.
The models may indeed be cargo-cult golems — I'd assume that by default, there's so much we don't yet know — but whatever is or isn't going on inside, they still do a good job of quacking like a duck.
[0] Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59, 433–460. https://doi.org/10.1093/mind/LIX.236.433
[1] Searle, J. R. (1980). Minds, brains, and programs. Behavioral and Brain Sciences, 3(3), 417–424. https://doi.org/10.1017/S0140525X00005756
2 replies →
> Instead, think of your queries as super human friendly SQL.
I feel that comparison oversells things quite a lot.
The user is setting up a text document which resembles a question-and-response exchange, and executing a make-any-document-bigger algorithm.
So it's less querying for data and more like shaping a sleeping dream of two fictional characters in conversation, in the hopes that the dream will depict one character saying something superficially similar to mostly-vanished data.
2 replies →
> think of your queries as super human friendly SQL > The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.
I disagree that this is the accurate way to think about LLMs. LLMs still use a finite number of parameters to encode the training data. The amount of training data is massive in comparison to the number of parameters LLMs use, so they need to be somewhat capable of distilling that information into small pieces of knowledge they can then reuse to piece together the full answer.
But this being said, they are not capable of producing an answer outside of the training set distribution, and inherit all the biases of the training data as that's what they are trying to replicate.
> I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said. And I've seen some pretty poor code examples out there. Yup, exactly this.
> I wish people would understand what a large language model is.
I think your view of llm does not explain the learning of algorithms that these constructs are clearly capable of, see for example: https://arxiv.org/abs/2208.01066
More generally, the best way to compress information from too many different coding examples is to figure out how to code rather than try to interpolate between existing blogs and QA forums.
My own speculation is that with additional effort during training (RL or active learning in the training loop) we will probably reach superhuman coding performance within two years. I think that o3 is still imperfect but not very far from that point.
21 replies →
Every model for how to approach an LLM seems lacking to me. I would suggest anyone using AI heavily to take a weekend and make a simple one to do the handwriting digit recognition. Once you get a feel for basic neural network, then watch a good introduction to alexnet. Then you can think of an LLM as being the next step in the sequence.
>I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.
This isn't correct. It embeds concepts that humans have discussed, but can combine them in ways that were never in the training set. There are issues with this, the more unique the combination of concepts, the more likely the output ends up being unrelated to what the user was wanting to see.
> I wish people would understand what a large language model is. There is no thinking. No comprehension. No decisions.
> Instead, think of your queries as super human friendly SQL.
Ehh this might be true in some abstract mathy sense (like I don't know, you are searching in latent space or something), but it's not the best analogy in practice. LLMs process language and simulate logical reasoning (albeit imperfectly). LLMs are like language calculators, like a TI-86 but for English/Python/etc, and sufficiently powerful language skills will also give some reasoning skills for free. (It can also recall data from the training set so this is where the SQL analogy shines I guess)
You could say that SQL also simulates reasoning (it is equivalent to Datalog after all) but LLMs can reason about stuff more powerful than first order logic. (LLMs are also fatally flawed in the sense it can't guarantee correct results, unlike SQL or Datalog or Prolog, but just like us humans)
Also, LLMs can certainly make decisions, such as the decision to search the web. But this isn't very interesting - a thermostat makes the decision of whether turn air refrigeration on or off, for example, and an operating system makes the decision of which program to schedule next on the CPU.
1 reply →
I actually find it super refreshing that they write "beginner" or "tutorial code".
Maybe because of experience: it's much simpler and easier to turn that into "senior code". After a few decades of experience I appreciate simplicity over the over-engineering mess that some mid-level developers tend to produce.
True. It's not elitist. There are some limits though to sensible use of built-in functions. Stops being comprehensible fast.
yeah I’m interested in asking it to “write more human readable code” over and over next, “more readable!”
I used to really like Claude for code tasks but lately it has been a frustrating experience. I use it for writing UI components because I just don’t enjoy FE even though I have a lot of experience on it from back in the day.
I tell it up front that I am using react-ts and mui.
80% of the time it will use tailwind classes which makes zero sense. It won’t use the sx prop and mui system.
It is also outdated it seems. It keeps using deprecated props and components which sucks and adds more manual effort on my end to fix. I like the quality of Claude’s UX output, it’s just a shame that it seems so bad on actual coding tasks.
I stopped using it for any backend work because it is so outdated, or maybe it just doesn’t have the right training data.
On the other hand, I give ChatGPT a link to the docs and it gives me the right code 90% or more of the time. Only shame is that its UX output is awful compared to Claude. I am also able to trust it for backend tasks, even if it is verbose AF with the explanations (it wants to teach me even if I tell it to return code only).
Either way, using these tools in conjunction saves me at least 30 min to an hour daily on tasks that I dislike.
I can crank out code better than AI, and I actually know and understand systems design and architecture to build a scalable codebase both technically and from organizational level. Easy to modify and extend, test, and single responsibility.
AI just slams everything into a single class or uses weird utility functions that make no sense on the regular. Still, it’s a useful tool in the right use cases.
Just my 2 cents.
I've stopped using LLMs to write code entirely. Instead, I use Claude and Qwen as "brilliant idiots" for rubber ducking. I never copy and paste code it gives me, I use it to brainstorm and get me unstuck.
I'm more comfortable using it this way.
Having spent nearly 12 hours a day for a year with GPTs I agree that this is the way. Treat it like a professor on office hours who’s sometimes a little apathetically wrong because they’re overworked and underfunded
People should try to switch to a more code-focused interface, like aider.
Copy and pasting code it gives you just means your workflow is totally borked, and it's no wonder you wouldn't want to try to let it generate code, because it's such a pain in your ass to try it, diff it, etc.
9 replies →
To each their own, and everyone's experience seems to vary, but I have a hard time picturing people using Claude/ChatGPT web UIs for any serious developmen. It seems like so much time would he wasted recreating good context, copy/pasting, etc.
We have tools like Aider (which has copy/paste mode if you don't have API access for some reason), Cline, CoPilot edit mode, and more. Things like having a conventions file and exposing the dependencies list and easy additional of files into context seem essential to me in order to make LLMs productive, and I always spend more time steering results when easy consistent context isn't at my fingertips.
Before tue advent of proper IDE integrations and editors like Zed, copy pasting form the web UI was basically how things were done, and man was it daunting. As you say, having good, fine grained, repeatable and we'll integrated context management is paramount to efficient LLM based work.
1 reply →
Both these issues can be resolved by adding some sample code to context to influence the LLM to do the desired thing.
As the op says, LLMs are going to be biased towards doing the "average" thing based on their training data. There's more old backend code on the internet than new backend code, and Tailwind is pretty dominant for frontend styling these days, so that's where the average lands.
>Problem is, how would you know if you have never learned to code without an LLM?
The quick fix I use when needing to do something new is to ask the AI to list me different libraries and the pros and cons of using them. Then I quickly hop on google and check which have good documentation and examples so I know I have something to fall back on, and from there I ask the AI how to solve small simple version of my problem and explain what the library is doing. Only then do I ask it for a solution and see if it is reasonable or not.
It isn't perfect, but it saves enough time most times to more than make up for when it fails and I have to go back to old fashion RTFMing.
Other imperfect things you can add to a prompt:
Things that might lead it away from tutorial style code.
It depends on the language too. Obviously there's way more "beginner code" out there in Python and Javascript than most other languages.
The next hurdle is lack of time sensitivity regarding standards and versions. You prompt mentioning exact framework version but still it comes up with deprecated or obsolete methods. Initially it may be appealing to someone knowing nothing about the framework but LLM won't grow anyone to an expert level in rapidly changing tech.
LLMs are trained on content from places like Stack Overflow, reddit, and github code, and they generate tokens calculated as a sort of aggregate statistically likely mediocre code. Of course the result is going be uninspired and impractical. Writing good code takes more than copy-pasting the same thing everyone else is doing.
I've just been using them for completion. I start writing, and give it a snippet + "finish refactoring this so that xyz."
That and unit tests. I write the first table based test case, then give it the source and the test code, and ask it to fill it in with more test cases.
I suspect it's not going to be much of a problem. Generated code has been getting rapidly better. We can readjust about what to worry about once that slows or stops, but I suspect unoptimized code will not be of much concern.
Totally agree, seen it too. Do you think it can be fixed over time with better training data and optimization? Or, is this a fundamental limitation that LLMs will never overcome?
This ^