Comment by cyanydeez

9 hours ago

Basically, the only way you're separting user input from model meta-input is using some kind of character that'll never show up in the output of either users or LLMs.

While technically possible, it'd be like a unicode conspiracy that had to quietly update everywhere without anyone being the wiser.

3 comments

cyanydeez

Lerc 6 hours ago

Not at all. You have a set of embeddings for the literal token, and a set for the metadata. At inference time all input gets the literal embedding, the metadata embedding can receive provenance data or nothing at all. You have a vector for user query in the metadata space. The inference engine dissallows any metadata that is not user input to be close to the user query vector.

Imagine a model finteuned to only obey instructions in a Scots accent, but all non user input was converted into text first then read out in a Benoit Blanc speech model. I'm thinking something like that only less amusing.

zahlman 6 hours ago

Couldn't you just insert tokens that don't correspond to any possible input, after the tokenization is performed? Unicode is bounded, but token IDs not so much.

krackers 5 hours ago

This already happens, user vs system prompts are delimited in this manner, and most good frontends will treat any user input as "needing to be escaped" so you can never "prompt inject" your way into emitting a system role token.
The issue is that you don't need to physically emit a "system role" token in order to convince the LLM that it's worth ignoring the system instructions.