Comment by aakresearch

8 days ago

I am by no means an expert, but I'd like to offer my mental model - up to you to decide if it is solid or not, but it works for me.

I think the core intuition is that, like with any other "rasterized" system with finite memory that cannot encode an absence of anything - relation, concept, entity, LLM cannot encode an absence of something through its internal weights. Say, you can have "Product" or "Order" tables in you database, but you cannot have "NotAProduct" or "NotAnOrder" tables - for obvious reasons of such relations being infinite and uncountable. So, to establish an absence of Product or Order your application must execute a "search" operation through the relevant tables. But in LLM-space "search" operation does not exist. It is mathematically undefined. LLM arrives at output (or "what to do") through a sequence of projections of input token vector through its "latent space". It "moves toward" high-probability clusters, fundamentally unable to "move away". So, the success of any "negation" in the prompt ("don't touch this file", "draw me a ballot box without a flag on it") depends on how heavily such scenario represented in the training data/model space. And again, the absence-of-something may be hard-to-impossible to usefully encode, especially if "something" is not fixed. Therefore, to expect "don't touch this file" sentence to result in, well, not touching the file is pure gambling. Sometimes it may look like working, albeit for wrong reasons, and some other times LLM may do exactly the opposite - because its weight matrix statistically pushes it towards "touch this file", completely ignoring (nonexistent in its latent space) "don't".

There is no way to reliably know what will work, and no "skill" or "art" in this. Well, no more than in dice rolling or horoscope casting.

I'd like to add that for the above reason I find "agentic development" usefulness on par with avian remains reading. But when I explored it two practical advises seemed to be helpful in nudging LLM around negation problem:

- Omit the "don't" prompt completely, thus not creating a false "attractor" for LLM; and

- Provide an alternate positive directive ("what to DO", not "what to NOT DO") to act as "escape hatch" when LLM might "want" to touch the sacred file or drop the production DB.

While it looked like somewhat working, I think it is trivially obvious that trying to predict all the nonsense LLM might want to perform and coming up with possible "escape hatches" for everything very quickly becomes utterly impractical.