Comment by koito17
12 days ago
How does this site tokenize text? Split on ASCII whitespace?
Inputting Japanese sentences of any length flags the whole sentence as "Dramatic Fragment: A standalone paragraph with ≤4 words".
12 days ago
How does this site tokenize text? Split on ASCII whitespace?
Inputting Japanese sentences of any length flags the whole sentence as "Dramatic Fragment: A standalone paragraph with ≤4 words".
That's right, its tokenization and fragment rules use fairly simple heuristics that assume whitespace delimited words plus English language/punctuation. Proper CJK support would require language specific tokenization and morphological parsing. Correcting rules like "≤4 words = dramatic fragment" would be difficult. The more complex rules already require LLM roundtrips, so supporting all languages in one pass would need to rely on LLM alone, I imagine.
Which brings up an interesting point - what do these LLM clichés look like in Japanese?
https://github.com/awnist/slop-cop#client-side-instant
> what do these LLM clichés look like in Japanese?
Besides text reading like a machine translation, the tell-tale signs often involve things like:
- itemized lists (I know, it's ironic that I'm using them here)
- frequent use of conjunctions
- use of demonstratives that feels redundant
- full-width colons, especially in titles
- subheadings that always end in abstract nouns
- bold text, especially at the beginning of a line
The demonstrative bit may be hard to express, but to give you an idea: when communicating in Japanese, words that can be understood from context may be omitted. Explicitly writing out words understood from context can sometimes make a sentence sound redundant.
Before LLMs were widespread, SEO spam in the Japanese net tended to be affiliate sites with predictable, template paragraphs. I get reminded of those sites whenever GPT starts a response with 「結論から言うと、〇〇」, since that's exactly how those affiliate sites wrote back in the day.