← Back to context

Comment by maxbond

18 hours ago

The escape algorithm here is very simple, you remove special tokens from the runtime tokenizer's vocabulary so that it's forced to encode them as multiple non-special tokens. (That doesn't actually mean the LLM won't treat them as special tokens though, so this isn't sufficient on it's own.)

Cool technique, but I'm not sure I'd call it simple.

Doing this means that you can't just tokenize the string output of the chat template as one big string. You might need to tokenize things separately, and combine them after.

  • If you want the token sequence, you ought to avoid discarding it when you produce the string output. This is because, even ignoring special tokens, different token sequences map to the same strings.

    From a space perspective, this is actually better because tokenization tends to compress text quite well. For example, common tokens in English text take up ~4 characters on average (expands to 32 bits), but only take up a fraction of that to store (15-18 bits/token depending on vocabulary size)

    In fact it appears that designing the tokens as a text compression encoding is a decent approach, since it's roughly what some LLMs do. For example, early GPT tokenizers followed byte pair encoding to create the vocabulary, which is a text compression algorithm from the 90s.