Comment by siva7
16 days ago
people read a bit more about transformer architecture to understand better why telling what not to do is a bad idea
16 days ago
people read a bit more about transformer architecture to understand better why telling what not to do is a bad idea
I find myself wondering about this though. Because, yes, what you say is true. Transformer architecture isn’t likely to handle negations particularly well. And we saw this plain as day in early versions of ChatGPT, for example. But then all the big players pretty much “fixed” negations and I have no idea how. So is it still accurate to say that understanding the transformer architecture is particularly informative about modern capabilities?
They did not "fix" the negation problem. It's still there. Along with other drift/misinterpretation issues.
I'm not sure that advice is effective either.
I use an LLM as a learning tool. I'm not interested in it implementing things for me, so I always ignore its seemingly frantic desires to write code by ignoring the request and prompting it along other lines. It will still enthusiastically burst into code.
LLMs do not have emotions, but they seem to be excessively insecure and overly eager to impress.
Please elaborate.
This is because LLMs don't actually understand language, they're just a "which word fragment comes next machine".
Now `${term}` is in the LLMs context window. Then the attention system will amply the logits related to `${term}` based on how often `${term}` appeared in chat. This is just how text gets transformed into numbers for the LLM to process. Relational structure of transformers will similarly amplify tokens related to `${term}` single that is what training is about, you said `fruit`, so `apple`, `orange`, `pear`, etc. all become more likely to get spat out.
The negation of a term (do not under any circumstances do X) generally does not work unless they've received extensive training & fining tuning to ensure a specific "Do not generate X" will influence every single down stream weight (multiple times), which they often do for writing style & specific (illegal) terms. So for drafting emails or chatting, works fine.
But when you start getting into advanced technical concepts & profession specific jargon, not at all.
But they must have received this fine-tuning, right?
Otherwise it's hard to explain why they follow these negations in most cases (until they make a catastrophic mistake).
I often test this with ChatGPT with ad-hoc word games, I tell it increasingly convoluted wordplay instructions, forbid it from using certain words, make it do substitutions (sometimes quite creative, I can elaborate), etc, and it mostly complies until I very intentionally manage to trip it up.
If it was incapable of following negations, my wordplay games wouldn't work at all.
I did notice that once it trips up, the mistakes start to pile up faster and faster. Once it's made a serious mistakes, it's like the context becomes irreparably tainted.
Pink elephant problem: Don't think about a pink elephant.
OK. Now, what are you thinking about? Pink elephants.
Same problem applies to LLMs.