Comment by vjerancrnjak
3 months ago
It must be the tokenizer. Figuring out words from an image is harder (edges, shapes, letters, words, ...), yet internal representations are more efficient.
I always found it strange that tokens can't just be symbols but instead there's an alphabet of 500k tokens, completely removing low level information from language (rhythm, syllables, etc.), side-effect being a simple edge case of 2 rs in strawberry, or no way to generate predefined rhyming patterns (without constrained sampling). There's an understandable reason for these big token dictionaries, but feels like a hack.
No comments yet
Contribute on Hacker News ↗