Comment by latexr
2 days ago
Can we please stop using the em-dash as a metric to “detect” LLM writing? It’s lazy and wrong. Plenty of people use em-dashes, it’s a useful punctuation mark. If humans didn’t use them, they wouldn’t be in the LLM training data.
There are better clues, like the kind of vague pretentious babble bad marketers use to make their products and ideas seem more profound than they are. It’s a type of bad writing which looks grandiose but is ultimately meaningless and that LLMs heavily pick up on.
Very few people use n dashes in internet writing as opposed to dashes as they are not available on the default keyboard.
This is a post with formatting and we're programmers here. I can assure you their editor (or Markdown) supports em-dash in some fashion.
That’s not true at all. Apple’s OS by default have smart punctuation enabled and convert -- (two hyphens) into — (“em-dash”; not an “en-dash”, which has a different purpose), " " (dumb quotes) into “ ” (smart quotes), and so forth.
Furthermore, on macOS there are simple key combinations (e.g. with ⌥) to make all sort of smart punctuation even if you don’t have the feature enabled by default, and on iOS you can long press on a key (such as the hyphen) to see alternates.
The majority of people may not use correct punctuation marks, but enough do that assuming a single character immediately means they used an LLM is just plain wrong. I have never used an LLM to write a blog post, internet comment, or anything of the sort, and I have used smart punctuation in all my writing for over a decade. Same with plenty of other HN commenters, journalists, writers, editors, and on and on. You don’t need to be a literal machine to care about correct character use.
So we’ve established the default is a hyphen, not an em dash.
You can certainly select an em dash but most don’t know what it means and don’t use it.
It’s certainly not infallible proof but multiple uses of it in comments online (vs published material or newspapers) are very unusual, so I think it’s an interesting indicator. I completely agree it is common in some texts, usually ones from publishing houses with style guides but also people who know about writing or typography.
> assuming a single character immediately means they used an LLM is just plain wrong
I don't see anyone doing that here. LLM writing was brought up because of the writing style, not the dash. It just reinforces the suspicion.
1 reply →
On the “default keyboard” of most people (a phone), you just long-press hyphen to choose any dash length.
But who does? Not many.
It's not a guarantee, but it does make it so much more likely. Therefore, it is an extremely useful prior to hold.
It's not lazy and wrong. It's a fantastic indicator.
> If humans didn’t use them, they wouldn’t be in the LLM training data.
Humans weren't using them in every context as they are now. They might've been used in books but blog posts and work documents weren't full of them.
It's not a definite thing but it's absolutely a good indicator.
Blog posts, news articles, and other web texts have been using correct punctuation marks for a long time. I know because I’ve been noticing misuses (usually having switched or repeated characters for quotes) for over a decade.
Plenty of people care about typographic punctuation, and others use software (such as Apple’s OSs, markdown converters, publishing and editing tools) which auto-converts smart punctuation. Heck, tools for doing that are older than Markdown, and that is already two decades old.
https://daringfireball.net/projects/smartypants/
Look, nowhere have I said using an em-dash can’t be an indicator, my objection is people using it as the indicator. It’s become a meme. Too many people act like if the existence of a single em-dash immediately and conclusively proves it was written by an LLM. It does not.
They may be overrepresented in the RLHF