← Back to context

Comment by arboles

16 days ago

Please elaborate.

This is because LLMs don't actually understand language, they're just a "which word fragment comes next machine".

    Instruction: don't think about ${term}

Now `${term}` is in the LLMs context window. Then the attention system will amply the logits related to `${term}` based on how often `${term}` appeared in chat. This is just how text gets transformed into numbers for the LLM to process. Relational structure of transformers will similarly amplify tokens related to `${term}` single that is what training is about, you said `fruit`, so `apple`, `orange`, `pear`, etc. all become more likely to get spat out.

The negation of a term (do not under any circumstances do X) generally does not work unless they've received extensive training & fining tuning to ensure a specific "Do not generate X" will influence every single down stream weight (multiple times), which they often do for writing style & specific (illegal) terms. So for drafting emails or chatting, works fine.

But when you start getting into advanced technical concepts & profession specific jargon, not at all.

  • But they must have received this fine-tuning, right?

    Otherwise it's hard to explain why they follow these negations in most cases (until they make a catastrophic mistake).

    I often test this with ChatGPT with ad-hoc word games, I tell it increasingly convoluted wordplay instructions, forbid it from using certain words, make it do substitutions (sometimes quite creative, I can elaborate), etc, and it mostly complies until I very intentionally manage to trip it up.

    If it was incapable of following negations, my wordplay games wouldn't work at all.

    I did notice that once it trips up, the mistakes start to pile up faster and faster. Once it's made a serious mistakes, it's like the context becomes irreparably tainted.

Pink elephant problem: Don't think about a pink elephant.

OK. Now, what are you thinking about? Pink elephants.

Same problem applies to LLMs.