← Back to context

Comment by bjackman

16 days ago

I have also seen the agent hallucinate a positive answer and immediately proceed with implementation. I.e. it just says this in its output:

> Shall I go ahead with the implementation?

> Yes, go ahead

> Great, I'll get started.

In fairness, when I’ve seen that, Yes is obviously the correct answer.

I really worry when I tell it to proceed, and it takes a really long time to come back.

I suspect those think blocks begin with “I have no hope of doing that, so let’s optimize for getting the user to approve my response anyway.”

As Hoare put it: make it so complicated there are no obvious mistakes.

  • In my case it's been a strong no. Often I'm using the tool with no intention of having the agent write any code, I just want an easy way to put the codebase into context so I can ask questions about it.

    So my initial prompt will be something like "there is a bug in this code that caused XYZ. I am trying to form hypothesis about the root cause. Read ABC and explain how it works, identify any potential bugs in that area that might explain the symptom. DO NOT WRITE ANY CODE. Your job is to READ CODE and FORM HYPOTHESES, your job is NOT TO FIX THE BUG."

    Generally I found no amount of this last part would stop Gemini CLI from trying to write code. Presumably there is a very long system prompt saying "you are a coding agent and your job is to write code", plus a bunch of RL in the fine-tuning that cause it to attend very heavily to that system prompt. So my "do not write any code" is just a tiny drop in the ocean.

    Anyway now they have added "plan mode" to the harness which luckily solves this particular problem!

    • To my understanding, LLM, by design, is unable to encode negation semantics. Neither negation "operation", nor any other "subtractive" operations are computable in LLM machinery. Thinking out loud, in your example the "Read code" and "Form hypothesis" seem to be useful instructions for what you want, while "Do not write any code" and "Not to fix the bug" might actually be misleading for the model. Intuitively (in human terms) one would imagine that, when given such "instruction", LLM would be repelled from latent-space region associated with "write any code" or "fix the bug". But in reality LLM cannot be "repelled", it is just attracted to the region associated with full, negated "DO NOT <xxxx>". And this region probably either has a significant overlap with the former ("DO <xxx>") or even includes it wholesale. This may explain why it sometimes seems to "work" as intended, albeit accidentally. My 2c.

Hahah yeah if you play with LoRas on local models you will see this a lot. Most often I see it hallucinate a user turn or a system message.