Comment by Hendrikto
1 year ago
No, actually much fewer tokens. 256 tokens cover all bytes. See the ByT5 paper: https://arxiv.org/abs/2105.13626
1 year ago
No, actually much fewer tokens. 256 tokens cover all bytes. See the ByT5 paper: https://arxiv.org/abs/2105.13626
More tokens to a sequence, though. And since it is learning sequences...
Yeah, suddenly 16k tokens is just 16kb of ASCII instead of ~6kwords