← Back to context

Comment by speedgoose

6 years ago

I don't understand why it shows Chinese characters. Assuming utf-8, English characters are a lot more compact than Chinese characters. So we can't really compare.

Otherwise it's a good idea and it works, but it's super slow, only working for English text, and the system requirements are huge. I like it.

It's counting characters, so it is comparable.

This is useful for applications that limit the number of characters, e.g. Twitter.

  • Yep, as far as I can tell, you can cram about twice as much information to the same number of Japanese as you would cram into Latin characters.

    I wonder if Chinese is even more info dense, as it does not have the syllabic hiragana/katakana characters ?

    • Modern Chinese is typically more dense than modern Japanese (which is partially phoenetic), and ancient formal Chinese is even more compact than modern Chinese.

      However it's worth noting that Chinese characters are analogous to entire words in English, and are composed of components much like English characters are composed of letters.

      For example "thanks" is spelled "t h a n k s"

      "謝" is made up of "言 身 寸"

      (Of course, the components in Chinese have less correlation to their pronunciation, but the main point I'm making here is that there is a LOT of overlap in the common components used to assemble the entire Chinese lexicon.)

      It is really not a fair comparison to compare languages in terms of their number of characters needed to represent something.

      Better measures would be the fastest time (in seconds) needed to use speech to convey a concept intelligibly to an average native speaker, or the square centimeters of paper needed to convey an idea given the same level of eyesight.

      1 reply →

  • I remember seeing awhile ago from HN a post where someone compressed an entire game of <idk^> in a tweet. They used chinese characters for it.

    Maybe someone remembers better than me.

    ^ It was something like Go or Tetris where they were tracking every single move.

I agree, this is confusing. It also shows Korean and Hiragana mixed with Chinese. The significance of this is confusing to CJK speakers.

If you're counting by "number of characters" you might as well use the entire Unicode range including all the Emoji if you are going to mix up Chinese+Japanese+Korean, which nobody would already never do.

Also, "number of characters" is a bit meaningless in the sense that human-intelligible Chinese is already far more compact than human-intelligible English in number of characters, and that's only because each character inherently carries more information, and not because the language itself is a compressed representation of ideas. Chinese characters are also made up of a standard set of components that are reused throughout the lexicon and assembled into different ways to make different characters, so it isn't "fair" to count a Chinese character on the same footing as an English character. A Chinese character is loosely more analogous to an entire word in English, and the components inside Chinese characters are kind of analogous to English letters (except they are arranged two-dimensionally and are most of the time not related to a character's pronunciation).

A more interesting study would be compressing English text into shorter strings that use English characters only.

  • The raw output is actually closer to random bits; I'm certain he just mapped those bits to CJK characters so they would be printable. The output is not intended to be intelligible, as far as I can tell.