Comment by james412

6 years ago

It's not Chinese text, it's an arithmetic-coded stream of bits mapped so the bits fall within the range of some codepoints. It's basically a variant of base64 except for Unicode.

(Side note: aren't these codepoints very expensive to encode in UTF-8? It seems there must be a lower-valued range more suited to it)

7 comments

james412

toast0 6 years ago

The page for base32768 has some efficiency charts for different binary to text encodings on top of different UTF encodings, as well as how many bytes you can use them to stuff in a tweet. Depends on where you're going to house the data, I guess.

https://github.com/qntm/base32768

infogulch 6 years ago

In addition to being 94% efficient in UTF-16 (!), this reveals some additional reasons why one might want to optimize for number of characters: fitting as many bytes as possible into a tweet which is bounded in the number of characters not bytes.

greenshackle2 6 years ago

Yeah I don't understand why it's using CJK, the page claims:

> each compressed character holds 15 data bits by using the CJK and the Hangul Syllables unicode ranges.

In UTF-8 these characters take 3 bytes each. Which makes it less space efficient than base64 (60% overhead vs 33% overhead).

The CJK/Hangul scheme has more information per character but I'm not sure where that matters.

jkhdigital 6 years ago
It's so that the output uses printable characters, that's all. The raw output would actually just be random bits, or at least something approximating random bits if GPT-2 is as good as we hope it is.
- greenshackle2 6 years ago
  
  Yes, I understand. But base64 is the bog standard solution for encoding arbitrary binary data as printable characters.
  So I'm just wondering why you would use something more obscure, less space efficient, and not ascii compatible.
  
  1 reply →

willcipriano 6 years ago

It's probably similar to this: https://pieroxy.net/blog/pages/lz-string/index.html

Check the 'How does it look?' section.