Comment by sebastianmestre

2 hours ago

If you want the token sequence, you ought to avoid discarding it when you produce the string output. This is because, even ignoring special tokens, different token sequences map to the same strings.

From a space perspective, this is actually better because tokenization tends to compress text quite well. For example, common tokens in English text take up ~4 characters on average (expands to 32 bits), but only take up a fraction of that to store (15-18 bits/token depending on vocabulary size)

In fact it appears that designing the tokens as a text compression encoding is a decent approach, since it's roughly what some LLMs do. For example, early GPT tokenizers followed byte pair encoding to create the vocabulary, which is a text compression algorithm from the 90s.