Comment by kmeisthax

1 year ago

If tokenization is such a big problem, then why aren't we training new base models on randomly non-tokenized data? e.g. during training, randomly substitute some percentage of the input tokens with individual letters.

0 comments