← Back to context

Comment by mrguyorama

2 days ago

I would bet like a dollar that the supposed em-dash usage (which I'm not convinced is an accurate take in the first place) would have come from an enterprising dev somewhere being like "Well, we probably don't need multiple tokens for hyphens" and coercing every dash type thing to just one hyphen like token.

But I'm also showing off my ignorance with how these machines turn text into tokens in practice.

I think all the em-dashes came from scraping Wordpress blogs. Wordpress editor does "typography", then thus introduced em-dashes survive HTML to Markdown process used to scrap them, and end up in datasets.

EDIT: Also PDFs authored in MS Word.

If that were true, it would mean that it couldn't output hyphenated words without turning the hyphens into em dashes.

Two dashes is still a token. You would only be correct if LLMs were still thinking at the level of characters.