← Back to context

Comment by darepublic

10 months ago

So in my interactions with gpt, o3 and o4 mini, I am the organic middle man that copy and pastes code into the repl and reports on the output back to gpt if anything should be the problem. And for me, past a certain point, even if you continually report back problems it doesn't get any better in its new suggestions. It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process. Maybe the llms you are using are better than the ones I tried this with?

Specifically I was researching a lesser known kafka-mqtt connector: https://docs.lenses.io/latest/connectors/kafka-connectors/si..., and o1 was hallucinating the configuration needed to support dynamic topics. The docs said one thing, and I even mentioned it to o1 that the docs contradicted with it. But it would stick to its guns. If I mentioned that the code wouldn't compile it would start suggesting very implausible scenarios -- did you spell this correctly? Responses like that indicate you've reached a dead end. I'm curious how/if the "structured LLM interactions" you mention overcome this.

> And for me, past a certain point, even if you continually report back problems it doesn't get any better in its new suggestions. It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process.

It sucks, but the trick is to always restart the conversations/chat with a new message. I never go beyond one reply, and also copy-paste a bunch. Got tired of copy-pasting, wrote something like a prompting manager (https://github.com/victorb/prompta) to make it easier, and not having to neatly format code blocks and so on.

Basically make one message, if they get the reply wrong, iterate on the prompt itself and start fresh, always. Don't try to correct by adding another message, but update initial prompt to make it clearer/steer more.

But I've noticed that every model degrades really quickly past the initial reply, no matter what length of each individual message. The companies seem to continue to increase the theoretical and practical context limits, but the quality degrades a lot faster even within the context limits, and they don't seem to try to address that (nor have a way of measuring it).

  • This is my experience as well, as has been for over a year now.

    LLMs are so incredibly transformative when they're incredibly transformative. And when they aren't it's much better to fall back on the years of hard won experience I have - the sooner the better. For example I'll switch between projects and languages and even with explicit instruction to move to a strongly typed language they'll stick to dynamic answers. It's an odd experience to re-find my skills every once in a while. "Oh yeah, I'm pretty good at reading docs myself".

    With all the incredible leaps in LLMs being reported (especially here on HN) I really haven't seen much of a difference in quite a while.

    • Interesting. This is another problem aider does not experience. It works on a git repo. If you switch repos, it changes context.

      I’m not affiliated with aider. I just use it.

      I have a bet that many of the pitfalls people experience at the moment are due to mismatched tools or immature tools.

  • In other words don't use the context window. Treat it as a command line with input/output, in which the purpose of the command is to extract information signal or knowledge manipulation or data mining and so on.

    Also special care has to be given to the number of tokens. Even with one-question/one-answer, 5 hundred to 1 thousand tokens can be focused at once by our artificial overlords. After that they start losing their marbles. There are exceptions to that rule with the reasoning models, but in essence they are not that different.

    The difference of using the tool correctly versus not, might be that instead of getting 99.9% accuracy, the user gets just 98%. Probably that doesn't sound that big of a difference to some people. The difference is that it works 10 times better in the first case.

    • People keep throwing these 95%+ accuracy rates for LLMs in these discussions, but that is nonsense. It's closer to 70%. It's quite terrible. I use LLMs but I never trust them beyond just doing some initial search if I am stumped, and when it unblocks me I immediately put it down again. It's not transformative, it's merely replacing google because search there has sucked for a while.

      6 replies →

    • I think “don’t use the context window” might be too simple. It can be incredibly useful. But avoid getting in a context window loop. When iterations stop showing useful progress toward the goal, it’s time to abandon the context. LLMs tend to circle back to the same dead end solution path at some point. It also helps to jump between LLMs to get a feel for how they perform on different problem spaces.

      2 replies →

  • Aider, the tool, does exactly the opposite, in my experience.

    It really works, for me. It iterates by itself and fixes the problem.

    • I'm in the middle of test-driving Aider and I'm seeing exactly the same problem, the longer a conversation goes on, the worse the quality of replies... Currently, I'm doing something like this to prevent it from loading previous context:

          rm -r .aider.chat.history.md .aider.tags.cache.v4 || true && aider --architect --model deepseek/deepseek-reasoner --editor-model deepseek/deepseek-chat --no-restore-chat-history
      

      Which clears the history, then I basically re-run that whenever the model gets something wrong (so I can update/improve the prompt and try again).

      1 reply →

I refused to stop being a middle man, because I can often catch really bad implementation early and can course correct. E.g. a function which solves a problem with a series of nested loops that can be done several orders of magnitude faster by using vectorised operations offered by common packages like numpy.

Even with all the coding agent magik people harp on about, I've never seen something that can write clean good quality code reliably. I'd prefer to tell an LLM what a functions purpose is, what kind of information and data structures it can expect and what it should output, see what it produces, provide feedback, and get a rather workable often perfect function in return.

If i get it to write the whole thing in one go, I cannot imagine the pain of having to find out where the fuckery is that slows everything down, without diving deep with profilers etc. all for a problem I could have solved by just playing middle man and keeping a close eye on how things are building up and being in charge of ensuring the overarching vision is achieved as required.

> If I mentioned that the code wouldn't compile it would start suggesting very implausible scenarios

I have to chuckle at that because it reminds me of a typical response on technical forums long before LLMs were invented.

Maybe the LLM has actually learned from those responses and is imitating them.

  • It seems no discussion of LLMs on HN these days is complete without a commenter wryly observing how that one specific issue someone is pointing to with an LLM is also, funnily enough, an issue they've seen with humans. The implication always seems to be that this somehow bolsters the idea that LLMs are therefore in some sense and to some degree human-like.

    Humans not being infallible superintelligences does not mean that the thing that LLMs are doing is the same thing we do when we think, create, reason, etc. I would like to imagine that most serious people who use LLMs know this, but sometimes it's hard to be sure.

    Is there a name for the "humans stupid --> LLMs smart" fallacy?

    • > Is there a name for the "humans stupid --> LLMs smart" fallacy?

      No one is saying "humans stupid --> LLMs smart". That's absolutely not the commenter above you said. Your whole comment is strawman fallacy.

    • > The implication always seems to be that this somehow bolsters the idea that LLMs are therefore in some sense and to some degree human-like.

      Nah, it's something else: it's that LLMs are being held to a higher standard than humans. Humans are fallible, and that's okay. The work they do is still useful. LLMs do not have to be perfect either to be useful.

      The question of how good they are absolutely matters. But some error isn't immediately disqualifying.

      2 replies →

    • well they try to copy humans and humans on the internet are very different creatures from humans in face to face interaction. So I see the angle.

      It is sad that, inadvertendly or not, LLMs may have picked up on the traits of the loudest humans. abrasive, never admitting fault, always trying to bring something up that sounds plausible but falls under scrutiny. Only thing it holds back on is resorting to insults when cornered.

    • > the idea that LLMs are therefore in some sense and to some degree human-like.

      This is 100% true, isn't it? It is based on the corpus of humankind knowledge and interaction, it is only expected that it would "repeat" human patterns. It also makes sense that the way to evolve the results we get from it is to mimic human organization, politics, sociology in the a new layer on top of LLMs to surpass current bottlenecks, just like they were used to evolve human societies.

      2 replies →

You can have agent search the web for documentation and then provide it to the LLM. That is how Context7 is currently very popular in the AI user crowd.

  • I used o4 to generate nixos config files from the pasted modules source files. At first it did outdated config stuff, but with context files it worked very good.

  • Kagi Assistant can do this too but I find it's mostly useful because the traditional search function can find the pages the LLM loaded into its context before it started to output bullshit.

    It's nice when the LLM outputs bullshit, which is frequent.

Seriously Cursor (using Claude 3.5) does this all the time. It ends up with a pile of junk because it will introduce errors while fixing something, then go in a loop trying to fix the errors it created and slap more garbage on those.

Because it’s directly editing code in the IDE instead of me transferring sections of code from a chat window the large amount of bad code it writes it much more apparent.

I wonder if LLMs have been seen claiming “THERE’S A BUG IN THE COMPILER!”

A stage every developer goes through early in their development.

  • Gemini 2.5 got into as close to a heated argument with me as possible about the existence of a function in the kotlin coroutines library that was never part of the library (but does exist as a 5 year old PR still visible in github that was never merged in).

    It initially suggested I use the function as part of a solution, suggesting it was part of the base library and could be imported as such. When I told it that function didn't exist within the library it got obstinate and argued back and forth with me to the point where it told me it couldn't help me with that issue anymore but would love to help me with other things. It was surprisingly insistent that I must be importing the wrong library version or doing something else wrong.

    When I got rid of that chat's context and asked it about the existence of that function more directly without the LLM first suggesting its use to me, it replied correctly that the function doesn't exist in the library but the concept being easy to implement... the joys(?) of using an LLM and having it go in wildly different directions depending upon the starting point.

    I'm used to the opposite situation where an LLM will slide into sycophantic agreeable hallucinations so it was in a way kind of refreshing for Gemini to not do this, but on the other hand for it to be so confidently and provably wrong (while also standing its ground on its wrongness) got me unreasonably pissed off at it in a way that I don't experience when an LLM is wrong in the other direction.

    • That we're getting either sycophantic or intransigent hallucinations point to two fundamental limitations: there's no getting rid of hallucinations, and there's a trade-off in observed agreement "behavior".

      Also, the recurring theme of "just wipe out context and re-start" places a hard ceiling on how complex an issue the LLM can be useful for.

> It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process.

Question is would you rather find out it got stuck in a loop with 3 minutes with a coding agent or 40 minutes copy pasting. It can also get out of loops more often by being able to use tools to look up definitions with grep, ctags or language server tools, though you can copy paste commands for that too it will be much slower.