Comment by dheera
6 years ago
I agree, this is confusing. It also shows Korean and Hiragana mixed with Chinese. The significance of this is confusing to CJK speakers.
If you're counting by "number of characters" you might as well use the entire Unicode range including all the Emoji if you are going to mix up Chinese+Japanese+Korean, which nobody would already never do.
Also, "number of characters" is a bit meaningless in the sense that human-intelligible Chinese is already far more compact than human-intelligible English in number of characters, and that's only because each character inherently carries more information, and not because the language itself is a compressed representation of ideas. Chinese characters are also made up of a standard set of components that are reused throughout the lexicon and assembled into different ways to make different characters, so it isn't "fair" to count a Chinese character on the same footing as an English character. A Chinese character is loosely more analogous to an entire word in English, and the components inside Chinese characters are kind of analogous to English letters (except they are arranged two-dimensionally and are most of the time not related to a character's pronunciation).
A more interesting study would be compressing English text into shorter strings that use English characters only.
The raw output is actually closer to random bits; I'm certain he just mapped those bits to CJK characters so they would be printable. The output is not intended to be intelligible, as far as I can tell.