Comment by vfalbor

7 hours ago

This is perfectly legitimate. It's something I've been denouncing day after day. Company X charges you 10dolar per token, while company Y charges you 7dolar, yet company X is cheaper because of the tokenizer they use. The token consumption depends on the tokenizer, and companies create tokenizers using standard algorithms like BPE. But they're charging for hardware access, and the system can be biased to the point that if you speak in English, you consume 17% less than if your prompt is written in Spanish, or even if you write with Chinese characters, you'll significantly reduce your token consumption compared to English speakers. I've written about this several times on HN, but for whatever reason, every time I mention it, they flag my post.

I have often wondered if Chinese is a much 'better' language for LLMs - every character is a token, boom you're done. No weird subword nonsense, no strange semantics being applied to arbitrary chunks of words.. I feel like there must be benefits to being able to have the language tokenized in what must be very close to 1:1.

  • I'm not sure what the state of the art is today, but 15 years ago I worked on a cross-lingual search engine - a challenge with Chinese was that ngram-like models for detecting common language errors (such as typos) were simply ineffective due to this.

    We found a lot of gain by having ranking features based on Pinyin to detect typos/misspellings due to homophones (and similar sounding words). I was investigating stroke decomposition to try to be able to detect near homographs, but wasn't able to find any good libraries at the time.

    I could imagine the homophone issue is especially relevant for spoken input to LLMs. LLMs are good enough that they're usually right, so it's probably less of an issue, but in English I can have crazy typos and everything just works, I am curious how well that would work for Chinese, since I suspect it's a harder problem by far due to the lack of subword tokens?

  • Yes, it is. In fact, I made a small application to reduce the token consumption for translating from one language to another, and I even invented a language called Tokinensis, which is a mix of different languages, and I ran my own tests with savings of 30%. Chinese is amazing because they encapsulate a ton of information in a single symbol, so you can save a ton of tokens.