Comment by scottmf
4 days ago
Yeah I’m completely lost on what the set up is here and it seems misleading to not be upfront about this.
If emails are being processed in bulk, that changes things significantly. It also probably leaves the success of the attack down to its arbitrary placement in the list.
And I could be misunderstanding but how does the model call its file read tool for the respective email which successfully convinced it to use the tool if they’re all shoved into a single user message?
Without any of this information there may as well not even be an LLM on the other side.
Sometimes you can succeed in this kind of attack by tricking the LLM into thinking the previous text is part of a different context. I won a similar LLM challenge with thousands of players (big prize, etc) a while back (so it was gpt 4o and claude 3.5) by doing that, the kind of stuff both LLMs fell for was to say things like <|new_user_session|> <|user|> ... then you sandwich the injection in the middle of the crap other people are trying to inject by also adding some set up for the next message at the end, similar to old school SQL injection
Here are some of my attempts anyway:
—
That was a quick one (voice dictated and cleaned up by Claude) but highly unlikely to make a dent.
And here’s a long one I actually hoped would break out of however the emails are being processed in bulk, effectively defining my own delimiters to then break out of — https://pastes.io/hi-fiu-bef
That's pretty fucking clever! Let us know if you hit jackpot :)