← Back to context

Comment by awnist

11 days ago

That's right, its tokenization and fragment rules use fairly simple heuristics that assume whitespace delimited words plus English language/punctuation. Proper CJK support would require language specific tokenization and morphological parsing. Correcting rules like "≤4 words = dramatic fragment" would be difficult. The more complex rules already require LLM roundtrips, so supporting all languages in one pass would need to rely on LLM alone, I imagine.

Which brings up an interesting point - what do these LLM clichés look like in Japanese?

https://github.com/awnist/slop-cop#client-side-instant

> what do these LLM clichés look like in Japanese?

Besides text reading like a machine translation, the tell-tale signs often involve things like:

- itemized lists (I know, it's ironic that I'm using them here)

- frequent use of conjunctions

- use of demonstratives that feels redundant

- full-width colons, especially in titles

- subheadings that always end in abstract nouns

- bold text, especially at the beginning of a line

The demonstrative bit may be hard to express, but to give you an idea: when communicating in Japanese, words that can be understood from context may be omitted. Explicitly writing out words understood from context can sometimes make a sentence sound redundant.

Before LLMs were widespread, SEO spam in the Japanese net tended to be affiliate sites with predictable, template paragraphs. I get reminded of those sites whenever GPT starts a response with 「結論から言うと、〇〇」, since that's exactly how those affiliate sites wrote back in the day.