Comment by aakresearch

10 days ago

To my understanding, LLM, by design, is unable to encode negation semantics. Neither negation "operation", nor any other "subtractive" operations are computable in LLM machinery. Thinking out loud, in your example the "Read code" and "Form hypothesis" seem to be useful instructions for what you want, while "Do not write any code" and "Not to fix the bug" might actually be misleading for the model. Intuitively (in human terms) one would imagine that, when given such "instruction", LLM would be repelled from latent-space region associated with "write any code" or "fix the bug". But in reality LLM cannot be "repelled", it is just attracted to the region associated with full, negated "DO NOT <xxxx>". And this region probably either has a significant overlap with the former ("DO <xxx>") or even includes it wholesale. This may explain why it sometimes seems to "work" as intended, albeit accidentally. My 2c.

0 comments

aakresearch

No comments yet

Contribute on Hacker News ↗