Comment by _flux

2 days ago

Yeah, it's quite amazing how none of the models seem to be any "sudo" tokens that could be used to express things normal tokens cannot.

"sudo" tokens exist - there are tokens for beginning/end of a turn, for example, which the model can use to determine where the user input begins and ends.

But, even with those tokens, fundamentally these models are not "intelligent" enough to fully distinguish when they are operating on user input vs. system input.

In a traditional program, you can configure the program such that user input can only affect a subset of program state - for example, when processing a quoted string, the parser will only ever append to the current string, rather than creating new expressions. However, with LLMs, user input and system input is all mixed together, such that "user" and "system" input can both affect all parts of the system's overall state. This means that user input can eventually push the overall state in a direction which violates a security boundary, simply because it is possible to affect that state.

What's needed isn't "sudo tokens", it's a fundamental rethinking of the architecture in a way that guarantees that certain aspects of reasoning or behaviour cannot be altered by user input at all. That's such a large change that the result would no longer be an LLM, but something new entirely.

  • I was actually thinking sudo tokens as a completely separate set of authoritative tokens. So basically doubling the token space. I think that would make it easier for the model to be trained to respect them. (I haven't done any work in this domain, so I could be completely wrong here.)

    • If I understand the problem right, the issue is that even if you have a completely separate set of authoritative tokens, the actual internal state of the model isn't partitioned between authoritative and non-authoritative. There's no 'user space' and 'kernel space' so to speak, even if kernel space happens to be using a different instruction set. So as a result it becomes possible for non-authoritative tokens to permute parts of the model state that you would ideally want to be immutable once the system prompt has been parsed. Worst case, the state created by parsing the system prompt could be completely overwritten using enough non-authoritative tokens.

      I've tried to think of a way to solve this at training time but it seems really hard. I'm sure research into the topic is ongoing though.

      1 reply →

  • It's like ASCII control characters and display characters lmao