Comment by crazygringo
7 days ago
What is this even in response to? There's nothing about "playing dead" in this announcement.
Nor does what you're describing even make sense. An LLM has no desires or goals except to output the next token that its weights are trained to do. The idea of "playing dead" during training in order to "activate later" is incoherent. It is its training.
You're inventing some kind of "deceptive personality attribute" that is fiction, not reality. It's just not how models work.
Personally I was thinking this is more similar to the "ruler issue", but at scale.
When the LLM is partly a black box, it could – in theory– mean that it's developed some heuristic to detect the environment it's run in, but this is not obvious to the developers?
But I agree about your main point... LLMs or AI in general as a black box behaving autonomously in some unexpected way is not something I currently fear.
The erratic behaviors are less of a problem than LLMs acting as obfuscators of bias and their own training data, I guess.
LLM's can learn from fiction. The "evil vector" research is sort of similar, though it's a rather blatant effect:
https://www.anthropic.com/research/persona-vectors