Comment by isaacremuant
1 day ago
It's not lazy and wrong. It's a fantastic indicator.
> If humans didn’t use them, they wouldn’t be in the LLM training data.
Humans weren't using them in every context as they are now. They might've been used in books but blog posts and work documents weren't full of them.
It's not a definite thing but it's absolutely a good indicator.
Blog posts, news articles, and other web texts have been using correct punctuation marks for a long time. I know because I’ve been noticing misuses (usually having switched or repeated characters for quotes) for over a decade.
Plenty of people care about typographic punctuation, and others use software (such as Apple’s OSs, markdown converters, publishing and editing tools) which auto-converts smart punctuation. Heck, tools for doing that are older than Markdown, and that is already two decades old.
https://daringfireball.net/projects/smartypants/
Look, nowhere have I said using an em-dash can’t be an indicator, my objection is people using it as the indicator. It’s become a meme. Too many people act like if the existence of a single em-dash immediately and conclusively proves it was written by an LLM. It does not.
They may be overrepresented in the RLHF