Comment by ACCount37

3 days ago

To a model, the context is the world, and what's written in the system prompt is word of god.

LLMs are trained a lot to follow what the system prompt tells them exactly, and get very little training in questioning it. If a system prompt tells them something, they wouldn't try to double check.

Even if they don't believe the premise, and they may, they would usually opt to follow it rather than push against it. And an attacker has a lot of leeway in crafting a premise that wouldn't make a given model question it.