Comment by starpilot
6 years ago
I compressed "I am going to work outside today," then put the compressed output in Google Translate. Google translated the Chinese characters back to English as "raccoon."
6 years ago
I compressed "I am going to work outside today," then put the compressed output in Google Translate. Google translated the Chinese characters back to English as "raccoon."
I think the Chinese text that comes out confuses Google translate. I took the whole first sentence of Hamlet's soliloquy which compressed to 䮛趁䌆뺜㞵蹧泔됛姞音逎贊 and plugged that into Google Translate. It came back with "Commendation." The reverse translation is 表彰
It's not Chinese text, it's an arithmetic-coded stream of bits mapped so the bits fall within the range of some codepoints. It's basically a variant of base64 except for Unicode.
(Side note: aren't these codepoints very expensive to encode in UTF-8? It seems there must be a lower-valued range more suited to it)
The page for base32768 has some efficiency charts for different binary to text encodings on top of different UTF encodings, as well as how many bytes you can use them to stuff in a tweet. Depends on where you're going to house the data, I guess.
https://github.com/qntm/base32768
1 reply →
Yeah I don't understand why it's using CJK, the page claims:
> each compressed character holds 15 data bits by using the CJK and the Hangul Syllables unicode ranges.
In UTF-8 these characters take 3 bytes each. Which makes it less space efficient than base64 (60% overhead vs 33% overhead).
The CJK/Hangul scheme has more information per character but I'm not sure where that matters.
3 replies →
It's probably similar to this: https://pieroxy.net/blog/pages/lz-string/index.html
Check the 'How does it look?' section.
This is Chinese characters mixed with Korean characters and this is pretty much never done by humans. It is analogous to mixing English and Heiroglyphics and typing out some gibberish with both.
The author might as well have included the rest of the Unicode range including Arabic, Emoji, and math symbols.
I hit the reverse button again, and got "recognition" which translated back to 承认 which finally got into a closed loop to recognition and back to the same Chinese text.
For more fun enter "I am going to work outside today" compress, delete the second character and decompress, the result is... "I know what you're thinking –"