← Back to context

Comment by koito17

12 days ago

How does this site tokenize text? Split on ASCII whitespace?

Inputting Japanese sentences of any length flags the whole sentence as "Dramatic Fragment: A standalone paragraph with ≤4 words".

That's right, its tokenization and fragment rules use fairly simple heuristics that assume whitespace delimited words plus English language/punctuation. Proper CJK support would require language specific tokenization and morphological parsing. Correcting rules like "≤4 words = dramatic fragment" would be difficult. The more complex rules already require LLM roundtrips, so supporting all languages in one pass would need to rely on LLM alone, I imagine.

Which brings up an interesting point - what do these LLM clichés look like in Japanese?

https://github.com/awnist/slop-cop#client-side-instant

  • > what do these LLM clichés look like in Japanese?

    Besides text reading like a machine translation, the tell-tale signs often involve things like:

    - itemized lists (I know, it's ironic that I'm using them here)

    - frequent use of conjunctions

    - use of demonstratives that feels redundant

    - full-width colons, especially in titles

    - subheadings that always end in abstract nouns

    - bold text, especially at the beginning of a line

    The demonstrative bit may be hard to express, but to give you an idea: when communicating in Japanese, words that can be understood from context may be omitted. Explicitly writing out words understood from context can sometimes make a sentence sound redundant.

    Before LLMs were widespread, SEO spam in the Japanese net tended to be affiliate sites with predictable, template paragraphs. I get reminded of those sites whenever GPT starts a response with 「結論から言うと、〇〇」, since that's exactly how those affiliate sites wrote back in the day.