Comment by diggan
10 months ago
> And for me, past a certain point, even if you continually report back problems it doesn't get any better in its new suggestions. It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process.
It sucks, but the trick is to always restart the conversations/chat with a new message. I never go beyond one reply, and also copy-paste a bunch. Got tired of copy-pasting, wrote something like a prompting manager (https://github.com/victorb/prompta) to make it easier, and not having to neatly format code blocks and so on.
Basically make one message, if they get the reply wrong, iterate on the prompt itself and start fresh, always. Don't try to correct by adding another message, but update initial prompt to make it clearer/steer more.
But I've noticed that every model degrades really quickly past the initial reply, no matter what length of each individual message. The companies seem to continue to increase the theoretical and practical context limits, but the quality degrades a lot faster even within the context limits, and they don't seem to try to address that (nor have a way of measuring it).
This is my experience as well, as has been for over a year now.
LLMs are so incredibly transformative when they're incredibly transformative. And when they aren't it's much better to fall back on the years of hard won experience I have - the sooner the better. For example I'll switch between projects and languages and even with explicit instruction to move to a strongly typed language they'll stick to dynamic answers. It's an odd experience to re-find my skills every once in a while. "Oh yeah, I'm pretty good at reading docs myself".
With all the incredible leaps in LLMs being reported (especially here on HN) I really haven't seen much of a difference in quite a while.
Interesting. This is another problem aider does not experience. It works on a git repo. If you switch repos, it changes context.
I’m not affiliated with aider. I just use it.
I have a bet that many of the pitfalls people experience at the moment are due to mismatched tools or immature tools.
In other words don't use the context window. Treat it as a command line with input/output, in which the purpose of the command is to extract information signal or knowledge manipulation or data mining and so on.
Also special care has to be given to the number of tokens. Even with one-question/one-answer, 5 hundred to 1 thousand tokens can be focused at once by our artificial overlords. After that they start losing their marbles. There are exceptions to that rule with the reasoning models, but in essence they are not that different.
The difference of using the tool correctly versus not, might be that instead of getting 99.9% accuracy, the user gets just 98%. Probably that doesn't sound that big of a difference to some people. The difference is that it works 10 times better in the first case.
People keep throwing these 95%+ accuracy rates for LLMs in these discussions, but that is nonsense. It's closer to 70%. It's quite terrible. I use LLMs but I never trust them beyond just doing some initial search if I am stumped, and when it unblocks me I immediately put it down again. It's not transformative, it's merely replacing google because search there has sucked for a while.
95% accuracy VS 70% accuracy, both numbers are pulled out of someone's ass and serves little to the discussion at hand. How did you measure that, or rather since you didn't, what's the point of sharing this hypothetical 25% difference?
And how funny that you comment seems to land perfectly together with this about people having very different experiences with using LLMs:
> I am still trying to sort out why experiences are so divergent. I've had much more positive LLM experiences while coding than many other people seem to, even as someone who's deeply skeptical of what's being promised about them. I don't know how to reconcile the two.
https://news.ycombinator.com/item?id=43898532
3 replies →
This has largely been my experience as well, at least with GH Copilot. I mostly use it as a better Google now, because even with context of my existing code, it can't adhere to style at all. Hell, it can't even get docker compose files right, using various versions and incorrect parameters all the time.
I also noticed language matters a lot as well. It's pretty good with Python, pandas, matplotlib, etc. But ask it to write some PowerShell and it regularly hallucinates modules that don't exist, more than any other language I've tried to use it with.
And good luck if you're working with a stack that's not flavor of the month with plenty of online information available. ERP systems with its documentation living behind a paywall, so it's not in the training data - you know, real world enterprise CRUD use cases where I'd want to use it the most, it's the least helpful.
1 reply →
I think “don’t use the context window” might be too simple. It can be incredibly useful. But avoid getting in a context window loop. When iterations stop showing useful progress toward the goal, it’s time to abandon the context. LLMs tend to circle back to the same dead end solution path at some point. It also helps to jump between LLMs to get a feel for how they perform on different problem spaces.
Depends on the use case. For programming where every small detail might have huge implications, 98% accuracy vs 99.99% is a ginormous difference.
Other tasks can be more forgiving, like writing which i do all the time, then i load 3000 tokens in the context window pretty frequently. Small details in the accuracy do not matter so much for most people, for everyday casual tasks like rewriting text, summarizing etc.
In general, be wary of how much context you load into the chat, performance degrades faster than you can imagine. Ok, the aphorism i started with, was a little simplistic.
1 reply →
Aider, the tool, does exactly the opposite, in my experience.
It really works, for me. It iterates by itself and fixes the problem.
I'm in the middle of test-driving Aider and I'm seeing exactly the same problem, the longer a conversation goes on, the worse the quality of replies... Currently, I'm doing something like this to prevent it from loading previous context:
Which clears the history, then I basically re-run that whenever the model gets something wrong (so I can update/improve the prompt and try again).
why not use /clear and /reset command?