Comment by sparkie

2 hours ago

VLQ/LEB128 are a bit better than the EBML's variable size integers. You test the MSB in the byte - `0` means it's the end of a sequence and the next byte is a new sequence. If the MSB is `1`, to find the start of the sequence you walk back until you find the first zero MSB at the end of the previous sequence (or the start of the stream). There are efficient SIMD-optimized implementations of this.

The difference between VLQ and LEB128 is endianness, basically whether the zero MSB is the start or end of a sequence.

    0xxxxxxx                   - ASCII
    1xxxxxxx 0xxxxxxx          - U+0080 .. U+3FFF
    1xxxxxxx 1xxxxxxx 0xxxxxxx - U+4000 .. U+10FFFD

                      0xxxxxxx - ASCII
             0xxxxxxx 1xxxxxxx - U+0080 .. U+3FFF
    0xxxxxxx 1xxxxxxx 1xxxxxxx - U+4000 .. U+10FFFD

It's not self-synchronizing like UTF-8, but it's more compact - any unicode codepoint can fit into 3 bytes (which can encode up to 0x1FFFFF), and ASCII characters remain 1 byte. Can grow to arbitrary sizes. It has a fixed overhead of 1/8, whereas UTF-8 only has overhead of 1/8 for ASCII and 1/3 thereafter. Could be useful compressing the size of code that uses non-ASCII, since most of the mathematical symbols/arrows are < U+3FFF. Also languages like Japanese, since Katakana and Hiragana are also < U+3FFF, and could be encoded in 2 bytes rather than 3.

1 comment

sparkie

kstenerud 1 hour ago

Unfortunately, VLQ/LEB128 is slow to process due to all the rolling decision points (one decision point per byte, with no ability to branch predict reliably). It's why I used a right-to-left unary code in my stuff: https://github.com/kstenerud/bonjson/blob/main/bonjson.md#le...

  | Header     | Total Bytes | Payload Bits |
  | ---------- | ----------- | ------------ |
  | `.......1` |      1      |       7      |
  | `......10` |      2      |      14      |
  | `.....100` |      3      |      21      |
  | `....1000` |      4      |      28      |
  | `...10000` |      5      |      35      |
  | `..100000` |      6      |      42      |
  | `.1000000` |      7      |      49      |
  | `10000000` |      8      |      56      |
  | `00000000` |      9      |      64      |

The full value is stored little endian, so you simply read the first byte (low byte) in the stream to get the full length, and it has the exact same compactness of VLQ/LEB128 (7 bits per byte).

Even better: modern chips have instructions that decode this field in one shot (callable via builtin):

https://github.com/kstenerud/ksbonjson/blob/main/library/src...

    static inline size_t decodeLengthFieldTotalByteCount(uint8_t header) {
        return (size_t)__builtin_ctz(header) + 1;
    }

After running this builtin, you simply re-read the memory location for the specified number of bytes, then cast to a little-endian integer, then shift right by the same number of bits to get the final payload - with a special case for `00000000`, although numbers that big are rare. In fact, if you limit yourself to max 56 bit numbers, the algorithm becomes entirely branchless (even if your chip doesn't have the builtin).

https://github.com/kstenerud/ksbonjson/blob/main/library/src...

It's one of the things I did to make BONJSON 35x faster to decode/encode compared to JSON.

https://github.com/kstenerud/bonjson

If you wanted to maintain ASCII compatibility, you could use a 0-based unary code going left-to-right, but you lose a number of the speed benefits of a little endian friendly encoding (as well as the self-synchronization of UTF-8 - which admittedly isn't so important in the modern world of everything being out-of-band enveloped and error-corrected). But it would still be a LOT faster than VLQ/LEB128.