Comment by phire

16 hours ago

The tokeniser is not a dictionary. It doesn't provide definitions, or give the LLM any kind of mapping at all.

At best, it's a wordlist. It gives the LLM some idea of what humans consider to be common words. But it doesn't tell the LLM anything at all about those words. And it's not even comprehensive, many words map to multiple tokens. Nor is it exclusively words, some of those tokens are punctuation, or modifiers, or control tokens. On multimodal LLMs, some of the tokens actually represent image and audio data.

The LLM doesn't get informed about any of this up front, it has to learn what every single token means from context.

You are technically right, that it's something in an LLM that's not weights; But it's not that structured. And really it's only there so the LLM can interact with the outside world.

> There are grammar rules

There is no dedicated "grammar rule" structure in the LLM or the tokeniser. It has to learn them all from context, they get encoded as part of the 80 layers of weights.

5 comments

phire

ozgung 16 hours ago

I see people give too much importance to specific engineering design choices of the current generation of LLMs. Tokenizer is not an absolutely essential part of the system. It’s just and adapter for text input/output. It can be eliminated completely and model can use bytes directly.

I think the short story captures this well. Weights (connections) are the essential and philosophically important part. They do the thinking, memory, singing etc.

yencabulator 10 hours ago
A tokenizer is roughly and approximately Huffman-coding sequences of input (bytes of English etc) into shorter sequences (list of tokens), as a performance optimization.
As you said, it's not in any way intrinsic to the LLM, though it may be a very necessary optimization on today's hardware.
- phire 9 hours ago
  
  I wouldn't use the word necessary.
  IMO, we are probably talking about a 6x slow down (for typical english). You would need to be absolutely stupid not to implement some kind of optimisation along these lines.
  Slower and maybe a little dumber; But it would work.
  
  2 replies →