Comment by empath75
2 days ago
I think you should assume that your LLM context is poisoned as soon as it touches anything from the outside world, and it has to lose all permissions until a new context is generated from scratch from a clean source under the user's control. I also think the pattern of 'invisible' contexts that aren't user inspectable is bad security practice. The end user needs to be able to see the full context being submitted to the AI at every step if they are giving it permissions to take actions.
You can mitigate jail breaks but you can't prevent them, and since the consequences of an LLM being jail broken with exfiltration are so bad, you pretty much have to assume they will happen eventually.
LLMs can consume input that is entirely invisible to humans (white text in PDFs, subtle noise patterns in images, etc), and likewise encode data completely invisibly to humans (steganographic text), so I think the game is lost as soon as you depend on a human to verify that the input/output is safe.