← Back to context

Comment by greenshackle2

6 years ago

Yeah I don't understand why it's using CJK, the page claims:

> each compressed character holds 15 data bits by using the CJK and the Hangul Syllables unicode ranges.

In UTF-8 these characters take 3 bytes each. Which makes it less space efficient than base64 (60% overhead vs 33% overhead).

The CJK/Hangul scheme has more information per character but I'm not sure where that matters.

It's so that the output uses printable characters, that's all. The raw output would actually just be random bits, or at least something approximating random bits if GPT-2 is as good as we hope it is.

  • Yes, I understand. But base64 is the bog standard solution for encoding arbitrary binary data as printable characters.

    So I'm just wondering why you would use something more obscure, less space efficient, and not ascii compatible.

    • Because it will impress uneducated people with how much smaller (in terms of screen real estate) the resulting message is.

      EDIT: Ohhh, I know, and because twitter cares about characters. So you can use this to put essays into tweets.