← Back to context

Comment by sigmoid10

14 hours ago

>It is not immediately clear that the optimization goal for the tokenizer is actually the one that best serves the training of the llm.

Except that is exactly what research has shown. Besides, the tokenizer's training goal is literally just to encode text efficiently with fewer tokens by increasing the vocabulary, which obviously directly benefits the attention mechanism if you look at the dimensions of involved matrices. The biggest issues so far have stemmed from variances between tokenizer and LLM training sets [1] and the fact that people primarily work with character based text and not word-part based text (even though that gets muddy when you look at what is actually happening in the brain) when doing anything in writing.

[1] https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...