← Back to context

Comment by _flux

2 days ago

I was actually thinking sudo tokens as a completely separate set of authoritative tokens. So basically doubling the token space. I think that would make it easier for the model to be trained to respect them. (I haven't done any work in this domain, so I could be completely wrong here.)

If I understand the problem right, the issue is that even if you have a completely separate set of authoritative tokens, the actual internal state of the model isn't partitioned between authoritative and non-authoritative. There's no 'user space' and 'kernel space' so to speak, even if kernel space happens to be using a different instruction set. So as a result it becomes possible for non-authoritative tokens to permute parts of the model state that you would ideally want to be immutable once the system prompt has been parsed. Worst case, the state created by parsing the system prompt could be completely overwritten using enough non-authoritative tokens.

I've tried to think of a way to solve this at training time but it seems really hard. I'm sure research into the topic is ongoing though.

  • >but it seems really hard.

    You are in manual breathing mode.

    I think this will be something that's going to be around a long while and take third party watching systems, much like we have to do with people.