← Back to context

Comment by ipieter

11 hours ago

There is currently very little evidence that morphological tokenizers help model performance [1]. For languages like German (where words get glued together) there is a bit more evidence (eg a paper I worked on [2]), but overall I start to suspect the bitter lesson is also true for tokenization.

[1] https://arxiv.org/pdf/2507.06378

[2] https://pieter.ai/bpe-knockout/

I never understood why people want this in the first place. Sure, making this step more human explainable would be nice and possibly even fix some very particular problems for particular languages, but it directly goes against the primary objective of a tokenizer: Optimizing sequence length vs. vocabulary size. This is a pretty clear and hard optimization target and the best you can do is make sure that your tokenizer training set more closely mimics your training and ultimately your inference data. Putting english or german grammar in there by force will only degrade every other language in the tokenzier, while we already know that limiting additional languages will hurt overall model performance. And the belief that you can encode a dataset of trillions of tokens into a more efficient vocabulary than a machine is kind of weird tbh. People have also accepted since the early convnet days that the best encoding representation for images in machine learning is not a human understandable one. Same goes for audio. So why should text be any different? If you really think so, you might also wanna have a go at feature engineering images. And it's not like people haven't tried that. But they all eventually learned their lesson.

  • We usually build the tokenizer by optimizing for one goal (space-efficient encoding of text), then use it in a model that is trained for an entirely different goal (producing good text, "reasoning", "coding", etc). It is not immediately clear that the optimization goal for the tokenizer is actually the one that best serves the training of the llm.

    That's what all these attempts boil down to. They don't presume to be able to find a more space-efficient encoding by hand, they assume that the optimization goal for the tokenizer was wrong and they can do better by adding some extra rules. And this isn't entirely without precendent, most tokenizers have a couple of "forced" tokens that were not organically discovered. Moving around how digits are grouped in the tokenizer is another point where wins have been shown.

    This is where projects like nanochat are really valuable for quickly and (relatively) cheaply trying out various tweaks

    • >It is not immediately clear that the optimization goal for the tokenizer is actually the one that best serves the training of the llm.

      Except that is exactly what research has shown. Besides, the tokenizer's training goal is literally just to encode text efficiently with fewer tokens by increasing the vocabulary, which obviously directly benefits the attention mechanism if you look at the dimensions of involved matrices. The biggest issues so far have stemmed from variances between tokenizer and LLM training sets [1] and the fact that people primarily work with character based text and not word-part based text (even though that gets muddy when you look at what is actually happening in the brain) when doing anything in writing.

      [1] https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...

  • If you want to make it more human-explainable, then ditch the entire tokenizer and just feed the models raw characters. Because now there is nothing to explain.

    • Then that means you need at least 4x the compute to achieve the same results as state of the art. Meaning that if I can train my frontier model with my normal tokenizer in 3 months, it will take you a year. When major releases across all competing providers are measured in months, there's simply no incentive to do that just to capture these fringe edge cases.