← Back to context

Comment by _verandaguy

6 days ago

    > context exhausition attack

Can you give a high-level overview of how this AV works? I'm a bit of an infosec geek but I generally dislike LLMs, so I haven't done a terribly good job of keeping up with that side of the industry, but this seems particularly interesting.

Presumably they mean the fundamental failure mode of LLMs that if you fill their context with stuff that stretches the bounds of their "safety training", suddenly deciding that "no, this goes too far" becomes a very low-probability prediction compared to just carrying on with it.

Models have a "context window" of tokens they will effectively process before they start doing things that go against the system prompt. In theory, some models go up to 1M tokens but I've heard it typically goes south around 250k, even for those models. It's not a difficult attack to execute: keep a conversation going in the web UI until it doesn't complain that you're asking for dangerous things. Maybe OP's specific results require more finesse (I doubt it), but the most basic attack is to just keep adding to the conversation context.

  • that 1M context thing, I wonder if it's just some abstraction thing where it compresses/sums up parts of the context so it fits into a smaller context window?

    • You don’t normally compress the system prompts, though I guess maybe it treats its own summary with more authority. This article [0] talks about the problem very well.

      Though I feel it’s most likely because models tend to degrade on large context (which can be seen experimentally). My guess is that they aren’t RLed on large context as much, but that’s just a guess.

      [0]: https://openai.com/index/instruction-hierarchy-challenge/

as the context fills up, the model will generate based on that context, incl. whatever illegal stuff you've said, i.e. it'll mimic that, instead of whatever safety prompt they have at the top

they could make it more "safe" but it'd be much more invasive and would likely have to scan much more tokens also, and it'd cause false positives (probably the biggest reason it's not implemented)

I don't really know how these models really work, but I had a theory that just as the models have limited attention so do the safety layers. I simply populated enough context with 'malicious' text without making the model trip that "wasted" the internal attention budget on tokens early in the prompt completely ignoring all the tokens that were generated after the fact.