Comment by wongarsu
13 hours ago
We usually build the tokenizer by optimizing for one goal (space-efficient encoding of text), then use it in a model that is trained for an entirely different goal (producing good text, "reasoning", "coding", etc). It is not immediately clear that the optimization goal for the tokenizer is actually the one that best serves the training of the llm.
That's what all these attempts boil down to. They don't presume to be able to find a more space-efficient encoding by hand, they assume that the optimization goal for the tokenizer was wrong and they can do better by adding some extra rules. And this isn't entirely without precendent, most tokenizers have a couple of "forced" tokens that were not organically discovered. Moving around how digits are grouped in the tokenizer is another point where wins have been shown.
This is where projects like nanochat are really valuable for quickly and (relatively) cheaply trying out various tweaks
>It is not immediately clear that the optimization goal for the tokenizer is actually the one that best serves the training of the llm.
Except that is exactly what research has shown. Besides, the tokenizer's training goal is literally just to encode text efficiently with fewer tokens by increasing the vocabulary, which obviously directly benefits the attention mechanism if you look at the dimensions of involved matrices. The biggest issues so far have stemmed from variances between tokenizer and LLM training sets [1] and the fact that people primarily work with character based text and not word-part based text (even though that gets muddy when you look at what is actually happening in the brain) when doing anything in writing.
[1] https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...