← Back to context

Comment by docmechanic

7 hours ago

Pretty sure that we’re talking apples and oranges. Yes to the arbitrary byte sequences used by tokenizers, but that is not the topic of discussion. The question is will the tokenizer come up with words not in the training vocabulary. Word tokenizers don’t, but character tokenizers do.

Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.

“If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”

"If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."