Comment by koito17

12 days ago

How does this site tokenize text? Split on ASCII whitespace?

Inputting Japanese sentences of any length flags the whole sentence as "Dramatic Fragment: A standalone paragraph with ≤4 words".

That's right, its tokenization and fragment rules use fairly simple heuristics that assume whitespace delimited words plus English language/punctuation. Proper CJK support would require language specific tokenization and morphological parsing. Correcting rules like "≤4 words = dramatic fragment" would be difficult. The more complex rules already require LLM roundtrips, so supporting all languages in one pass would need to rely on LLM alone, I imagine.

Which brings up an interesting point - what do these LLM clichés look like in Japanese?

https://github.com/awnist/slop-cop#client-side-instant

  • > what do these LLM clichés look like in Japanese?

    Besides text reading like a machine translation, the tell-tale signs often involve things like:

    - itemized lists (I know, it's ironic that I'm using them here)

    - frequent use of conjunctions

    - use of demonstratives that feels redundant

    - full-width colons, especially in titles

    - subheadings that always end in abstract nouns

    - bold text, especially at the beginning of a line

    The demonstrative bit may be hard to express, but to give you an idea: when communicating in Japanese, words that can be understood from context may be omitted. Explicitly writing out words understood from context can sometimes make a sentence sound redundant.

    Before LLMs were widespread, SEO spam in the Japanese net tended to be affiliate sites with predictable, template paragraphs. I get reminded of those sites whenever GPT starts a response with 「結論から言うと、〇〇」, since that's exactly how those affiliate sites wrote back in the day.