Comment by Benjammer

2 days ago

It's nice to see a paper that confirms what anyone who has practiced using LLM tools already knows very well, heuristically. Keeping your context clean matters, "conversations" are only a construct of product interfaces, they hurt the quality of responses from the LLM itself, and once your context is "poisoned" it will not recover, you need to start fresh with a new chat.

My experiences somewhat confirm these observations, but I also had one that was different. Two weeks of debugging IPSEC issues with Gemini. Initially, I imported all the IPSEC documentation from OPNsense and pfSense into Gemini and informed it of the general context in which I was operating (in reference to 'keeping your context clean'). Then I added my initial settings for both sides (sensitive information redacted!). Afterwards, I entered a long feedback loop, posting logs and asking and answering questions.

At the end of the two weeks, I observed that: The LLM was much less likely to become distracted. Sometimes, I would dump whole forum threads or SO posts into it, when it said "this is not what we are seeing here, because of [earlier context or finding]. I eliminated all dead ends logically and informed it of this (yes, it can help with the reflection, but I had to make the decisions). In the end, I found the cause of my issues.

This somewhat confirms what some user here on HN said a few days ago. LLMs are good at compressing complex information into simple one, but not at expanding simple ideas into complex ones. As long as my input was larger than the output (either complexity or length), I was happy with the results.

I could have done this without the LLM. However, it was helpful in that it stored facts from the outset that I had either forgotten or been unable to retrieve quickly in new contexts. It also made it easier to identify time patterns in large log files, which helped me debug my site-to-site connection. I also optimized many other settings along the way, resolving not only the most problematic issue. This meant, in addition to fixing my problem, I learned quite a bit. The 'state' was only occasionally incorrect about my current parameter settings, but this was always easy to correct. This confirms what others already saw: If you know where you are going and treat it as a tool, it is helpful. However, don't try to offload decisions or let it direct you in the wrong direction.

Overall, 350k Tokens used (about 300k words). Here's a related blog post [1] with my overall path, but not directly corresponding to this specific issue. (please don't recommend wireguard; I am aware of it)

    [1]: https://du.nkel.dev/blog/2021-11-19_pfsense_opnsense_ipsec_cgnat/

  • Recently, Gemini helped me fix a bug in a PPP driver (Zephyr OS) without prior knowledge of PPP or even driver development really. I would copy-paste logs of raw PPP frames in HEX and it would just decode everything and explain the meaning of each bytes. In about an hour, I knew enough about PPP to fix the bug and submit a patch.

    https://g.co/gemini/share/7edf8fa373fe

    • Or you could just read the PPP RFC [0].

      I’m not saying that your approach is wrong. But most LLM workflows are either brute forcing the solution, or seeking a local minima to be stuck in. It’s like doing thousands of experiments of objects falling to figure out gravity while there’s a physics textbooks nearby.

      [0]: https://datatracker.ietf.org/doc/html/rfc1661

      8 replies →

    • Interesting that it works for you. I tried several times something similar with frames from a 5G network and it mixed fields from 4G and 5G in its answers (or even from non-cellular network protocols because they had similar features as the 5G protocol I was looking at). Occasionally, the explanation was completely invented or based on discussions of planned features for future versions.

      I have really learned to mistrust and double check every single line those systems produce. Same for writing code. Everything they produce looks nice and reasonable on the surface but when you dig deaper it falls apart unless it's something very very basic.

      1 reply →

  • That's some impressive prompt engineering skills to keep it on track for that long, nice work! I'll have to try out some longer-form chats with Gemini and see what I get.

    I totally agree that LLMs are great at compressing information; I've set up the docs feature in Cursor to index several entire large documentation websites for major libraries and it's able to distill relevant information very quickly.

    • In Gemini, it is really good to have large window with 1M tokens. However, around 100,000 it starts to make mistakes and refactor its own code.

      Sometimes it is good to start new chat or switch to Claude.

      And it really helps to be very precise with wording of specification what you want to achieve. Or repeat it sometimes with some added request lines.

      GIGO in reality :)

      7 replies →

This matches my experience exactly. "poisoned" is a great way to put it. I find once something has gone wrong all subsequent responses are bad. This is why I am iffy on ChatGPT's memory features. I don't notice it causing any huge problems but I don't love how it pollutes my context in ways I don't fully understand.

  • It's interesting how much the nature of LLMs fundamentally being self recursive next token predictors aligns with the Chinese Room experiment. [1] In such experiment it also makes perfect sense that a single wrong response would cascade into a series of subsequent ever more drifting errors. I think it all emphasizes the relevance of the otherwise unqualifiable concept of 'understanding.'

    In many ways this issue could make the Chinese Room thought experiment even more compelling. Because it's a very practical and inescapable issue.

    [1] - https://en.wikipedia.org/wiki/Chinese_room

    • I don't think the Chinese room thought experiment is about this, or performance of LLMs in general. Searle explicitly argues that a program can't induce "understanding" even if it mimicked human understanding perfectly because programs don't have "causal powers" to generate "mental states".

      This is mentioned in the Wikipedia page too: "Although its proponents originally presented the argument in reaction to statements of artificial intelligence (AI) researchers, it is not an argument against the goals of mainstream AI research because it does not show a limit in the amount of intelligent behavior a machine can display."

    • Great comment on the Chinese room. That idea seems to be dismissed nowadays but the concept of “cascading failure to understand context” is absolutely relevant to LLMs. I often find myself needing to explain basic details over and over again to an LLM; when with a person it would be a five second, “no, I mean like this way, not that way” explanation.

  • I find using tools like LMStudio, which lets you edit your chat history on the fly, really helps deal with this problem. The models you can host locally are much weaker, but they perform a little better than the really big models once you need to factor in these poisoning problems.

    A nice middle-ground I'm finding is to ask Claude an initial conversation starter in its "thinking" mode, and then copy/paste that conversation into LMStudio and have a weaker model like Gemma pick-up from where Claude left off.

  • I have very limited experience with llms but i've always thought of it as a compounding errors problem, once you get a small error early on it can compound and go completely off track later.

  • good point on the memory feature. Wow that sounds terrible

    • The memory is easy to turn off. It sounded like a very bad idea to cross-contaminate chats so I disabled it as soon as ChatGPT introduced it.

I've been saying for ages that I want to be able to fork conversations so I can experiment with the direction an exchange takes without irrevocably poisoning a promising well. I can't do this with ChatGPT, is anyone aware of a provider that offers this as a feature?

  • Google AI studio, ChatGPT and Claude all support this. Google AI studio is the only one that let's you branch to a separate chat though. For ChatGPT and claude you just edit the message you want to branch from.

    • Feels like a semi-simple UX fix could make this a lot more natural. Git-style forks but for chats.

    • Support: Yes. But the UX is not optimized for this.

      Imagine trying to find a specific output/input that was good in the conversation tree.

      1 reply →

  • I once built something like this for fun as a side project.

    You can highlight some text in a chat and fork the chat to talk about that text selection, so the LLM has context of that along with the previous chat history and it responds in a new chat (entire chat history up to that point from the parent chat gets copied over - basically inspired by the Unix `fork`).

    Your text selection from the parent chat would get turned into a hyperlink to the new child chat so you can always get to it again if you're reading the parent chat.

  • T3.chat supports convo forking and in my experience works really well.

    The fundamental issue is that LLMs do not currently have real long term memory, and until they do, this is about the best we can do.

  • I need to think about this a bit more, but I think I would love a thread feature in ChatGPT, so that it has the context up to the point of creation but doesn’t affect the main conversation. It would help in two ways, it keeps the main topic from getting poisoned , and allow me to minimise text clutter when i go off on tangents during the conversation.

  • On Openrouter you can delete previous answers (and questions) and maintain a separate conversation with different models.

    But it would indeed be nice to either disable answers (without deleting them) or forking a conversation. It wouldn't be hard to implement; I wonder if there's a market for just this?

  • If you're happy running local models, llama.cpp's built-in web-server's interface can do this.

  • Some 3rd party UIs offer this, I use typingmind sometimes that does but AFAIK some open source ones do too.

The #1 tip I teach is to make extensive use of the teeny-tiny mostly hidden “edit” button in ChatGPT and Claude. When you get a bad response, stop and edit to get a better one, rather than letting crap start to multiply crap.

  • Hear hear! Basically if the first reply isn't good/didnt understand/got something wrong, restart from the beginning with a better prompt, explaining more/better. Rinse and repeat.

    • You can do even better by asking it to ask clarifying questions before generating anything, then editing your initial prompt with those clarifications.

  • It is also a great way to branch conversations from some shared “initial context”.

    They really need to make that edit feature much more prominent. It is such an important way to interact with the model.

An interesting little example of this problem is initial prompting, which is effectively just a permanent, hidden context that can't be cleared. On Twitter right now, the "Grok" bot has recently begun frequently mentioning "White Genocide," which is, y'know, odd. This is almost certainly because someone recently adjusted its prompt to tell it what its views on white genocide are meant to be, which for a perfect chatbot wouldn't matter when you ask it about other topics, but it DOES matter. It's part of the context. It's gonna talk about that now.

  • > This is almost certainly because someone recently adjusted its prompt to tell it what its views on white genocide are meant to be

    Well, someone did something to it; whether it was training, feature boosting the way Golden Gate Claude [0] was done, adjusting the system prompt, or assuring that it's internet search for contextual information would always return material about that, or some combination of those, is neither obvious nor, if someone had a conjecture as to which one or combination it was, easily falsifiable/verifiable.

    [0] https://www.anthropic.com/news/golden-gate-claude

    • Source [0]. The examples look pretty clearly like they stuck it in the context window, not trained it in. It consistently seems to structure the replies as though the user they're replying to is the one who brought up white genocide in South Africa, and it responds the way that LLMs often respond to such topics: saying that it's controversial and giving both perspectives. That's not behavior I would expect if they had done the Golden Gate Claude method, which inserted the Golden Gate Bridge a bit more fluidly into the conversation rather than seeming to address a phantom sentence that the user supposedly said.

      Also, let's be honest, in a Musk company they're going to have taking the shortest possible route to accomplishing what he wanted them to.

      [0] https://www.cnn.com/2025/05/14/business/grok-ai-chatbot-repl...

      1 reply →

  • Well, telling an AI chatbot to insist on discussing a white genocide seems like a perfectly Elon thing to do!

  • > This is almost certainly because someone recently adjusted its prompt to tell it what its views on white genocide are

    Do you have any source on this? System prompts get leaked/extracted all the time so imagine someone would notice this

    Edit: just realized you’re talking about the Grok bot, not Grok the LLM available on X or grok.com. With the bot it’s probably harder to extract its exact instructions since it only replies via tweets. For reference here’s the current Grok the LLM system prompt: https://github.com/asgeirtj/system_prompts_leaks/blob/main/g...

  • Probably because it is now learning from a lot of videos posted on X by misc right-wingers showing rallying cries of South African politicians like Julius Malema, Paul Mashatile etc. Not very odd.

    As merely 3 of over a dozen examples:

    https://x.com/DefiantLs/status/1922213073957327219

    https://x.com/PPC4Liberty/status/1922650016579018855

    https://x.com/News24/status/1920909178236776755

Has any interface implemented a .. history cleaning mechanism? Ie with every chat message focus on cleaning up dead ends in the conversation or irrelevant details. Like summation but organic for the topic at hand?

Most history would remain, it wouldn’t try to summarize exactly, just prune and organize the history relative to the conversation path?

  • I've had success having a conversation about requirements, asking the model to summarize the requirements as a spec to feed into a model for implementation, then pass that spec into a fresh context. Haven't seen any UI to do this automatically but fairly trivial/natural to perform with existing tools.

    • Doing the same. Though I wish there was some kind of optimization of text generated by an LLM for an LLM. Just mentioning it’s for an LLM instead of Juan consumption yields no observably different results.

  • "Every problem in computer science can be solved with another level of indirection."

    One could argue that the attention mechanism in transformers is already designed to do that.

    But you need to train it more specifically with that in mind if you want it to be better at damping attention to parts that are deemed irrelevant by the subsequent evolution of the conversation.

    And that requires the black art of ML training.

    While thinking of doing this as a hack on top of the chat product feels more like engineering and we're more familiar with that as a field.

  • the problem is that it needs to read the log to prune the log, and so if there is garbage in the log, which needs to be pruned to keep from poisoning the main chat, then the garbage will poison the pruning model, and it will do a bad job pruning.

  • I mean, you could build this, but it would just be a feature on top of a product abstraction of a "conversation".

    Each time you press enter, you are spinning up a new instance of the LLM and passing in the entire previous chat text plus your new message, and asking it to predict the next tokens. It does this iteratively until the model produces a <stop> token, and then it returns the text to you and the PRODUCT parses it back into separate chat messages and displays it in your UI.

    What you are asking the PRODUCT to now do is to edit your and its chat messages in the history of the chat, and then send that as the new history with your latest message. This is the only way to clean the context because the context is nothing more than your messages and its previous responses, plus anything that tools have pulled in. I think it would be sort of a weird feature to add to a chat bot to have the chat bot, each time you send a new message, go back through the entire history of your chat and just start editing the messages to prune out details. You would scroll up and see a different conversation, it would be confusing.

    IMO, this is just part of prompt engineering skills to keep your context clean or know how to "clean" it by branching/summarizing conversations.

  • Not a history cleaning mechanism, but related to that, Cursor in the most recent release introduced a feature to duplicate your chat (so you can saveguard yourself against poisoning and go back to and unpoisoned point in history), which seems like an addmision of the same problem.

  • Isn't this what Claude workbench in the Anthropic console does? It lets the user edit both sides of the conversation history.

Agreed poisoned is a good term. I’d like to see “version control” for conversations via the API and UI that lets you rollback to a previous place or clone from that spot into a new conversation. Even a typo or having to clarify a previous message skews the probabilities of future responses due to the accident.

  • "Forking" or "branching" (probably better received outside of SWEs) a conversation really ought to be a first class feature of ChatGPT et Al.

    • It is in Google Gemini, which I really hate to say - but I've been using a lot more than GPT. I reckon I'll be cancelling my Pro if Gemini stays with this lead for my everyday workflows.

      5 replies →

    • This was part of ChatGPT from pretty much the beginning, maybe not the initial version but few weeks later- don't recall exactly

    • It is!

      It exists in Claude as a true branch - you can see the old threads - and in ChatGPT as without the history.

      Edit a previous reply and hit “go” to see it in action.

  • This exists in Claude. Edit any previous message and it will fork the conversation.

I mostly just use LLMs for autocomplete (not chat), but wouldn’t this be fixed by adding a “delete message” button/context option in LLM chat UIs?

If you delete the last message from the LLM (so now, you sent the last message), it would then generate a new response. (This would be particularly useful with high-temperature/more “randomly” configured LLMs.)

If you delete any other message, it just updates the LLM context for any future responses it sends (the real problem at hand, context cleanup).

I think seeing it work this way would also really help end users who think LLMs are “intelligent” to better understand that it’s just a big, complex autocomplete (and that’s still very useful).

Maybe this is standard already, or used in some LLM UI? If not, consider this comment as putting it in the public domain.

Now that I’m thinking about it, it seems like it might be practical to use “sub-contextual LLMs” to manage the context of your main LLM chat. Basically, if an LLM response in your chat/context is very long, you could ask the “sub-contextual LLM” to shorten/summarize that response, thus trimming down/cleaning the context for your overall conversation. (Also, more simply, an “edit message” button could do the same, just with you, the human, editing the context instead of an LLM…)

  • This is how Claude’s UI used to work, in practice, where you could edit the context directly.

Weirdly it has gotten so far that I have embedded this into my workflow and will often prompt:

> "Good work so far, now I want to take it to another step (somewhat related but feeling it too hard): <short description>. Do you think we can do it in this conversation or is it better to start fresh? If so, prepare an initial prompt for your next fresh instantiation."

Sometimes the model says that it might be better to start fresh, and prepares a good summary prompt (including a final 'see you later'), whereas in other cases it assures me it can continue.

I have a lot of notebooks with "initial prompts to explore forward". But given the sycophancy going on as well as one-step RL (sigh) post-training [1], it indeed seems AI platforms would like to keep the conversation going.

[1] RL in post-training has little to do with real RL and just uses one shot preference mechanisms with an RL inspired training loop. There is very little work in terms of long-term preferences slash conversations, as that would increase requirements exponentially.

  • Is there any reason to think that LLMs have the introspection ability to be able to answer your question effectively? I just default to having them provide a summary that I can use to start the next conversation, because I’m unclear on how an LLM would know it’s losing the plot due to long context window.

I agree—once the context is "poisoned," it’s tough to recover. A potential improvement could be having the LLM periodically clean or reset certain parts of the context without starting from scratch. However, the challenge would be determining which parts of the context need resetting without losing essential information. Smarter context management could help maintain coherence in longer conversations, but it’s a tricky balance to strike.Perhaps using another agent to do the job?

I suppose that the chain-of-thought style of prompting that is used by AI chat applications internally also breaks down because of this phenomenon.

>"conversations" are only a construct of product interfaces

This seems to be in flux now due to RL training on multiturn eval datasets so while the context window is evergreen every time, there will be some bias towards interpreting each prompt as part of a longer conversation. Mutliturn post training is not scaled out yet in public but I think it may be the way to keep on the 'double time spent on goal every 7 months curve'

Yes even when coding and not conversing I often start new conversations where I take the current code and explain it new. This often gives better results than hammering on one conversation.

This feels like something that can be fixed with manual instructions which prompt the model to summarize and forget. This might even map appropriately to human psychology. Working Memory vs Narrative/Episodic Memory.

One of the most frustrating features of ChatGPT is “memories” which can cause that poisoning to follow you around between chats.

Which is why I really like zed's chat UX experience: being able to edit the full prior conversation like a text file, I can go back and clean it up, do small adjustments, delete turns etc and then continue the discussion with a cleaner and more relevant context.

I have made zed one of my main llm chat interfaces even for non-programming tasks, because being able to do that is great.

Yarp! And "poisoning" can be done with "off-topic" questions and answers as well as just sort of "dilution". Have noticed this when doing content generation repeatedly, tight instructions get diluted over time.

" 'conversations' are only a construct of product interface" is so helpful maintain top-of-mind, but difficult because of all the "conversational" cues

What surprised me is how early the models start locking into wrong assumptions

Happens with people too if you think about it.

  • Who gets lost in multi-turn conversations?

    • Everyone?

      How often in meetings does everyone maintain a running context of the entire conversation, instead of responding to the last thing that was said with a comment that has an outstanding chance of being forgotten as soon as the next person starts speaking?

      1 reply →

And now that chatgpt has a "memory" and can access previous conversations, it might be poisoned permanently. It gets one really bad idea, and forever after it insists on dumping that bad idea into every subsequent response ever after you repeatedly tell it "THAT'S A SHIT IDEA DON'T EVER MENTION THAT AGAIN". Sometimes it'll accidentally include some of its internal prompting, "user is very unhappy, make sure to not include xyz", and then it'll give you a response that is entirely focused around xyz.