Comment by WhitneyLand

1 year ago

Where did you get this idea from?

Tokens aren’t the source of facts within a model. it’s an implementation detail and doesn’t inherently constrain how things could be counted.

Tokens are the first form of information being encoded into the model. They're statistically guided, more or less a compression dictionary comparable to a Lempel Ziv setup.

Combinations of tokens get encoded, so if the feature isn't part of the information being carried forward into the network as it models the information in the corpus, the feature isn't modeled well, or at all. The consequence of having many character tokens is that the relevance of individual characters is lost, and you have to explicitly elicit the information. Models know that words have individual characters, but "strawberry" isn't encoded as a sequence of letters, it's encoded as an individual feature of the tokenizer embedding.

Other forms of tokenizing have other tradeoffs. The trend lately is to increase tokenizer dictionary scope, up to 128k in Llama3 from 50k in gpt-3. The more tokens, the more nuanced individual embedding features in that layer can be before downstream modeling.

Tokens inherently constrain how the notion of individual letters are modeled in the context of everything an LLM learns. In a vast majority of cases, the letters don't matter, so the features don't get mapped and carried downstream of the tokenizer.

  • So, it's conjecture then?

    What you're saying sounds plausible, but I don't see how we can conclude that definitively without at least some empirical tests, say a set words that predictably give an error along token boundaries.

    The thing is, there are many ways a model can get around to answering the same question, it doesn't just depend on the architecture but also on how the training data is structured.

    For example, if it turned out tokenization was the cause of this glitch, conceivably it could be fixed by adding enough documents with data relating to letter counts, providing another path to get the right output.

There isn't a lot of place that teach the AI which letters there is in each tokens. It's a made up concept, and the AI doesn't have enough information in the dataset about this concept, it can difficulty generalize this.

There is a lot of problems like that, that can be reformulated. For example if you ask it what is the biggest between 9.11 and 9.9, it will respond 9.9. If you look at how it's tokenized, you can see it restate an easy problem as something not straightforward even for a human. if you restart the problem by writing the number in full letters, it will correctly respnod.