Comment by akdor1154

11 hours ago

I have often wondered if Chinese is a much 'better' language for LLMs - every character is a token, boom you're done. No weird subword nonsense, no strange semantics being applied to arbitrary chunks of words.. I feel like there must be benefits to being able to have the language tokenized in what must be very close to 1:1.

Yes, it is. In fact, I made a small application to reduce the token consumption for translating from one language to another, and I even invented a language called Tokinensis, which is a mix of different languages, and I ran my own tests with savings of 30%. Chinese is amazing because they encapsulate a ton of information in a single symbol, so you can save a ton of tokens.

  • Are you able to use the language practically? How would that work? You prompt it in english but tell it to work in tokinensis? And then translate back at the end?

  • Interested; I came across a post that was mentioning using Kanji for specific use to reduce context.

    • Maybe in future there will be some "Tokenensis" but in kanjis which could concentrate a lot of info into little space.

I'm not sure what the state of the art is today, but 15 years ago I worked on a cross-lingual search engine - a challenge with Chinese was that ngram-like models for detecting common language errors (such as typos) were simply ineffective due to this.

We found a lot of gain by having ranking features based on Pinyin to detect typos/misspellings due to homophones (and similar sounding words). I was investigating stroke decomposition to try to be able to detect near homographs, but wasn't able to find any good libraries at the time.

I could imagine the homophone issue is especially relevant for spoken input to LLMs. LLMs are good enough that they're usually right, so it's probably less of an issue, but in English I can have crazy typos and everything just works, I am curious how well that would work for Chinese, since I suspect it's a harder problem by far due to the lack of subword tokens?