Comment by derefr
6 hours ago
I would point out that the default BPE tokenization vocabulary used by many models (cl100k_base) is already a pretty powerful shorthand. It has a lot of short tokens, sure. But then:
Token ID 73700 is the literal entire (space-prefixed) word " strawberry". (Which neatly explains the "strawberry problem.")
Token ID 27128 is " cryptocurrency". (And 41698 is " disappointment".)
Token ID 44078 is " UnsupportedOperationException"!
Token ID 58040 is 128 spaces in a row (and is the longest token in the vocabulary.)
You'd be surprised how well this vocabulary can compress English prose — especially prose interspersed with code!
No comments yet
Contribute on Hacker News ↗