Comment by lukewarm707
8 hours ago
you can use a safety model trained on prompt injections with developer message priority.
user message becomes close to untrusted compared to dev prompt.
also post train it only outputs things like safe/unsafe so you are relatively deterministic on injection or no injection.
ie llama prompt guard, oss 120 safeguard.
Unfortunately it's not that simple. Self-policing AI systems will always be gamed. Just one [0] example of this among many.
[0] https://www.hiddenlayer.com/research/same-model-different-ha...