Comment by lcnPylGDnU4H9OF

7 days ago

Models have a "context window" of tokens they will effectively process before they start doing things that go against the system prompt. In theory, some models go up to 1M tokens but I've heard it typically goes south around 250k, even for those models. It's not a difficult attack to execute: keep a conversation going in the web UI until it doesn't complain that you're asking for dangerous things. Maybe OP's specific results require more finesse (I doubt it), but the most basic attack is to just keep adding to the conversation context.

2 comments

lcnPylGDnU4H9OF

r_lee 7 days ago

that 1M context thing, I wonder if it's just some abstraction thing where it compresses/sums up parts of the context so it fits into a smaller context window?

strongpigeon 7 days ago

You don’t normally compress the system prompts, though I guess maybe it treats its own summary with more authority. This article [0] talks about the problem very well.
Though I feel it’s most likely because models tend to degrade on large context (which can be seen experimentally). My guess is that they aren’t RLed on large context as much, but that’s just a guess.
[0]: https://openai.com/index/instruction-hierarchy-challenge/