Comment by akdor1154

6 hours ago

I have often wondered if Chinese is a much 'better' language for LLMs - every character is a token, boom you're done. No weird subword nonsense, no strange semantics being applied to arbitrary chunks of words.. I feel like there must be benefits to being able to have the language tokenized in what must be very close to 1:1.

2 comments

akdor1154

vfalbor 4 hours ago

Yes, it is. In fact, I made a small application to reduce the token consumption for translating from one language to another, and I even invented a language called Tokinensis, which is a mix of different languages, and I ran my own tests with savings of 30%. Chinese is amazing because they encapsulate a ton of information in a single symbol, so you can save a ton of tokens.

touristtam 35 minutes ago

Interested; I came across a post that was mentioning using Kanji for specific use to reduce context.