Comment by avhception
6 days ago
When an agent just plows ahead with a wrong interpretation or understanding of something, I like to ask them why they didn't stop to ask for clarification. Just a few days ago, while refactoring minor stuff, I had an agent replace all sqlite-related code in that codebase with MariaDB-based code. Asked why that happened, the answer was that there was a confusion about MariaDB vs. sqlite because the code in question is dealing with, among other things, MariaDB Docker containers. So the word MariaDB pops up a few times in code and comments.
I then asked if there is anything I could do to prevent misinterpretations from producing wild results like this. So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding. But I didn't add it. Out of the 25 lines of my AGENTS.md, many are already variations of that. The first three:
- Do not try to fill gaps in your knowledge with overzealous assumptions.
- When in doubt: Slow down, double-check context, and only touch what was explicitly asked for.
- If a task seems to require extra changes, pause and ask before proceeding.
If these are not enough to prevent stuff like that, I don't know what could.
Are agents actually capable of answering why they did things? An LLM can review the previous context, add your question about why it did something, and then use next token prediction to generate an answer. But is that answer actually why the agent did what it did?
It depends. If you have an LLM that uses reasoning the explanation for why decisions are made can often be found in the reasoning token output. So if the agent later has access to that context it could see why a decision was made.
LLMs often already "know" the answer starting from the first output token and then emulate "reasoning" so that it appeared as if it came to the conclusion through logic. There's a bunch of papers on this topic. At least it used to be the case a few months ago, not sure about the current SOTA models.
1 reply →
Reasoning, in majority of cases, is pruned at each conversation turn.
3 replies →
of course not, but it can often give a plausible answer, and it's possible that answer will actually happen to be correct - not because it did any - or is capable of any - introspection, but because it's token outputs in response to the question might semi-coincidentally be a token input that changes the future outputs in the same way.
Well, the entire field of explainable AI has mostly thrown in the towel..
Isn't that question a category error? The "why" the agent did that is that it was the token that best matched the probability distribution of the context and the most recent output (modulo a bit of randomness). The response to that question will, again, be the tokens that best match the probability distribution of the context (now including the "why?" question and the previous failed attempt).
if the agent can review its reasoning traces, which i think is often true in this era of 1M token context, then it may be able to provide a meaningful answer to the question.
Wait, no, that's the category error I'm talking about. Any answer other than "that was the most likely next token given the context" is untrue. It is not describing what actually happened.
14 replies →
Just this morning I have run across an even narrower case of how AGENTS.md (in this case with GPT-5.3 Codex) can be completely ignored even if filled with explicit instructions.
I have a line there that says Codex should never use Node APIs where Bun APIs exist for the same thing. Routinely, Claude Code and now Codex would ignore this.
I just replaced that rule with a TypeScript-compiler-powered AST based deterministic rule. Now the agent can attempt to commit code with banned Node API usage and the pre-commit script will fail, so it is forced to get it right.
I've found myself migrating more and more of my AGENTS.md instructions to compiler-based checks like these - where possible. I feel as though this shouldn't be needed if the models were good, but it seems to be and I guess the deterministic nature of these checks is better than relying on the LLM's questionable respect of the rules.
Not that much different from humans.
We have pre-commit hooks to prevent people doing the wrong thing. We have all sorts of guardrails to help people.
And the “modern” approach when someone does something wrong is not to blame the person, but to ask “how did the system allow this mistake? What guardrails are missing?”
I wonder if some of these could be embedded in the write tool calls?
> So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding.
You may want to ask the next LLM versions the same question after they feed this paper through training.
It seems like LLMs in general still have a very hard time with the concepts of "doubt" and "uncertainty". In the early days this was very visible in the form of hallucinations, but it feels like they fixed that mostly by having better internal fact-checking. The underlying problem of treating assumptions as truth is still there, just hidden better.
LLMs are basically improv theater. If the agent starts out with a wildly wrong assumption it will try to stick to it and adapt it rather than starting over. It can only do "yes and", never "actually nevermind, let me try something else".
I once had an agent come up with what seemed like a pointlessly convoluted solution as it tried to fit its initial approach (likely sourced from framework documentation overemphasizing the importance of doing it "the <framework> way" when possible) to a problem for which it to me didn't really seem like a good fit. It kept reassuring me that this was the way to go and my concerns were invalid.
When I described the solution and the original problem to another agent running the same model, it would instantly dismiss it and point out the same concerns I had raised - and it would insist on those being deal breakers the same way the other agent had dimissed them as invalid.
In the past I've often found LLMs to be extremely opinionated while also flipping their positions on a dime once met with any doubt or resistance. It feels like I'm now seeing the opposite: the LLM just running with whatever it picked up first from the initial prompt and then being extremely stubborn and insisting on rationalizing its choice no matter how much time it wastes trying to make it work. It's sometimes better to start a conversation over than to try and steer it in the right direction at that point.
Doubt and uncertainty is left for us humans.
I really hate that the anthropomorphizing of these systems has successfully taken hold in people's brains. Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.
Even the "thinking" blocks in newer models are an illusion. There is no functional difference between the text in a thought block and the final answer. To the model, they are just more tokens in a linear sequence. It isn't "thinking" before it speaks, the "thought" is the speech.
Treating those thoughts as internal reflection of some kind is a category error. There is no "privileged" layer of reasoning happening in the silicon that then gets translated into the thought block. It’s a specialized output where the model is forced to show its work because that process of feeding its own generated strings back into its context window statistically increases the probability of a correct result. The chatbot providers just package this in a neat little window to make the model's "thinking" part of the gimmick.
I also wouldn't be surprised if asking it stuff like this was actually counter productive, but for this I'm going off vibes. The logic being that by asking that, you're poisoning the context, similar to how if you try generate an image by saying "It should not have a crocodile in the image", it will put a crocodile into the image. By asking it why it did something wrong, it'll treat that as the ground truth and all future generation will have that snippet in it, nudging the output in such a way that the wrong thing itself will influence it to keep doing the wrong thing more and more.
You're entirely correct in that it's a different model with every message, every token. There's no past memory for it to reference.
That said it can still be useful because you have a some weird behavior and 199k tokens of context, with no idea where the info is that's nudging it to do the weird thing.
In this case you can think of it less as "why did you do this?" And more "what references to doing this exist in this pile of files and instructions?"
Agreed. I wish more people understood the difference between tokens, embeddings, and latent space encodings. The actual "thinking" if you can call it that, happens in latent space. But many (even here on HN) believe the thinking tokens are the thoughts themselves. Silly meatbags!
Thinking happens in latent space, but the thinking trace is then the projection of that thinking onto tokens. Since autoregressive generation involves sampling a specific token and continuing the process, that sampling step is lossy.
However, it is a genuine question whether the literal meanings of thinking blocks are important over their less-observable latent meanings. The ultimate latent state attributable to the last-generated thinking token is some combination of the actual token (literal meaning) and recurrent thinking thus far. The latter does have some value; a 2024 paper (https://arxiv.org/abs/2404.15758) noted that simply adding dots to the output allowed some models to perform more latent computation resulting in higher-skill answers. However, since this is not a routine practice today I suspect that genuine "thinking" steps have higher value.
Ultimately, your thesis can be tested. Take the output of a reasoning model inclusive of thinking tokens, then re-generate answers with:
1. Different but semantically similar thinking steps (i.e. synonyms, summarization). That will test whether the model is encoding detailed information inside token latent space.
2. Meaningless thinking steps (dots or word salad), testing whether the model is performing detailed but latent computation, effectively ignoring the semantic context of
3. A semantically meaningful distraction (e.g. a thinking trace from a different question)
Look for where performance drops off the most. If between 0 (control) and 1, then the thinking step is really just a trace of some latent magic spell, so it's not meaningful. If between 1 and 2, then thinking traces serve a role approximately like a human's verbalized train of thought. If between 2 and 3 then the role is mixed, leading back to the 'magic spell' theory but without the 'verbal' component being important.
> I really hate that the anthropomorphizing of these systems has successfully taken hold in people's brains. Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.
"Thinking meat! You're asking me to believe in thinking meat!"
While next-token prediction based on matrix math is certainly a literal, mechanistic truth, it is not a useful framing in the same sense that "synapses fire causing people to do things" is not a useful framing for human behaviour.
The "theory of mind" for LLMs sounds a bit silly, but taken in moderation it's also a genuine scientific framework in the sense of the scientific method. It allows one to form hypothesis, run experiments that can potentially disprove the hypothesis, and ultimately make skillful counterfactual predictions.
> By asking it why it did something wrong, it'll treat that as the ground truth and all future generation will have that snippet in it, nudging the output in such a way that the wrong thing itself will influence it to keep doing the wrong thing more and more.
In my limited experience, this is not the right use of introspection. Instead, the idea is to interrogate the model's chain of reasoning to understand the origins of a mistake (the 'theory of mind'), then adjust agents.md / documentation so that the mistake is avoided for future sessions, which start from an otherwise blank slate.
I do agree, however, that the 'theory of mind' is very close to the more blatantly incorrect kind of misapprehension about LLMs, that since they sound humanlike they have long-term memory like humans. This is why LLM apologies are a useless sycophancy trap.
> Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.
Asking it why it did something isn’t useless, it just isn’t fullproof. If you really think it’s useless, you are way too heavily into binary thinking to be using AI.
Perfect is the enemy of useful in this case.
I genuinely fail to see the usefulness, though, it seems counterproductive to me to do this kinda stuff. In my experience I just throw out the whole chat/session as soon as I notice it's starting to repeat mistakes/start doing stupid shit consistently, the few times I've tried interrogating it I could immediately tell all it was doing is, for lack of a better word, being a sycophant and aping my words back at me.
1 reply →
This is like trying to fix hallucination by telling LLM not to hallucinate.
so many times have ended up here :
"You're absolutely correct. I should have checked my skills before doing that. I'll make sure I do it in the future."