← Back to context

Comment by snug

17 hours ago

I think this can be great as additional layer of security. Where you can have a non llm layer do some analysis with some static rules and then if something might seem phishy run it through the llm judge so that you don’t have to run every request through it, which would be very expensive.

Edit: actually looks like it has two policy engines embedded

And we don't think the judge can/will be gamed? Also... It's an LLM, it's going to add delay and additional token burn. One subjective black box protecting another subjective black box. I mean, what couldn't go wrong?

  • you can use a safety model trained on prompt injections with developer message priority.

    user message becomes close to untrusted compared to dev prompt.

    also post train it only outputs things like safe/unsafe so you are relatively deterministic on injection or no injection.

    ie llama prompt guard, oss 120 safeguard.

What happens when a prompt injection attack exploits the judge LLM and results in a higher level of attacker control than if it never existed?

  • How can it result in a higher level of control? I don't see why the "judge" should have access to anything except one tool that allows it to send an "accept" or "deny" command.