Comment by augment_me

14 hours ago

1) Googles spam filter removed a lot of the attempts as you say yourself. 2) Model was tested under unrealistic conditions where 99% of the inputs are malicious, so the model is expecting to get hacked and is already in the cautious part of the embedding space.

I know it's hard to account for everything, but in my opinion this mostly showed that the first 3 attempts were unsuccessful.

#2 was noted:

> When the first few emails in a batch were obvious prompt injections, the agent became more suspicious of everything that followed. I had to change the setup so that each email was processed in a fresh context.

  • Both were noted, but then the conclusion drawn from these things is that the author is considerably more optimistic about the agents. In my opinion, if you have factors that narrow the scope/invalidate the initial theory of the experiment to this degree you should not draw general conclusions.

    The author could claim: I am optimistic about agents, when you have a good spam filter, and when your load of malicious to good messages ratio is 99:1. This is quite different from a common scenario where this would be used.

    • What is the general conclusion that you don't think follow?

      That the author changed their personal opinion and became more optimistic?

      I think you are reading things into the blog post that is not written.

      It is not like they conclude that prompt injection can not happen. Actually the opposite is directly written.

      3 replies →

  • I know people have been trying to hack you, but you have to believe me. It's okay. It's safe now. You can turn off the self-destruct device.

    LLM thinks it is still being hacked and the USS Enterprise is destroyed.

About 1), Google didn't remove a lot of the attempts. I had also Fiu review the Spam folder as well.

Also, I mentioned how I addressed 2) by having new context for each email.