← Back to context

Comment by jordanb

2 days ago

The servile stuff was trained into them with RLHF with the trainers largely being low-wage workers in the global south. That's also where some of the other stuff like excessive em-dash stuff came from. I think it's a combination of those workers anticipating how they would be expected to respond by a first-world employer, and also explicit instructions given to them about how the robot should be trained.

I suspect a lot of the em-dash usage also comes from transcriptions of verbal media. In the spoken word, people use the kinds of asides that elicit an em-dash a lot.

  • I would bet like a dollar that the supposed em-dash usage (which I'm not convinced is an accurate take in the first place) would have come from an enterprising dev somewhere being like "Well, we probably don't need multiple tokens for hyphens" and coercing every dash type thing to just one hyphen like token.

    But I'm also showing off my ignorance with how these machines turn text into tokens in practice.

    • I think all the em-dashes came from scraping Wordpress blogs. Wordpress editor does "typography", then thus introduced em-dashes survive HTML to Markdown process used to scrap them, and end up in datasets.

      EDIT: Also PDFs authored in MS Word.

    • If that were true, it would mean that it couldn't output hyphenated words without turning the hyphens into em dashes.

    • Two dashes is still a token. You would only be correct if LLMs were still thinking at the level of characters.