Comment by 2snakes

1 year ago

I read one characterization which is that LLMs don't give new information (except to the user learning) but they reorganize old information.

8 comments

2snakes

barrenko 1 year ago

Custodians of human knowledge.

docmechanic 1 year ago

That’s only true if you tokenize words rather than characters. Character tokenization generates new content outside the training vocabulary.

selfhoster11 1 year ago
All major tokenisers have explicit support for encoding arbitrary byte sequences. There's usually a consecutive range of tokens reserved for 0x00 to 0xFF, and you can encode any novel UTF-8 words or structures with it. Including emoji and characters that weren't a part of the model's initial training, if you show it some examples.
- docmechanic 1 year ago
  
  Pretty sure that we’re talking apples and oranges. Yes to the arbitrary byte sequences used by tokenizers, but that is not the topic of discussion. The question is will the tokenizer come up with words not in the training vocabulary. Word tokenizers don’t, but character tokenizers do.
  Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.
  “If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”
  "If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."
  
  1 reply →
asdff 1 year ago

Why stop there? Just have it spit out the state of the bits on the hardware. English seems like a serious shackle for an LLM.
emaro 1 year ago
Kind of, but character-based tokens make it a lot harder and more expensive to learn semantics.
- docmechanic 1 year ago
  
  Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.
  “If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”
  "If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."