Comment by snug

17 hours ago

I think this can be great as additional layer of security. Where you can have a non llm layer do some analysis with some static rules and then if something might seem phishy run it through the llm judge so that you don’t have to run every request through it, which would be very expensive.

Edit: actually looks like it has two policy engines embedded

5 comments

snug

windexh8er 17 hours ago

And we don't think the judge can/will be gamed? Also... It's an LLM, it's going to add delay and additional token burn. One subjective black box protecting another subjective black box. I mean, what couldn't go wrong?

lukewarm707 6 hours ago
you can use a safety model trained on prompt injections with developer message priority.
user message becomes close to untrusted compared to dev prompt.
also post train it only outputs things like safe/unsafe so you are relatively deterministic on injection or no injection.
ie llama prompt guard, oss 120 safeguard.
- windexh8er 1 hour ago
  
  Unfortunately it's not that simple. Self-policing AI systems will always be gamed. Just one [0] example of this among many.
  [0] https://www.hiddenlayer.com/research/same-model-different-ha...

ImPostingOnHN 17 hours ago

What happens when a prompt injection attack exploits the judge LLM and results in a higher level of attacker control than if it never existed?

vova_hn2 16 hours ago

How can it result in a higher level of control? I don't see why the "judge" should have access to anything except one tool that allows it to send an "accept" or "deny" command.