← Back to context

Comment by kstenerud

14 hours ago

Unfortunately, VLQ/LEB128 is slow to process due to all the rolling decision points (one decision point per byte, with no ability to branch predict reliably). It's why I used a right-to-left unary code in my stuff: https://github.com/kstenerud/bonjson/blob/main/bonjson.md#le...

  | Header     | Total Bytes | Payload Bits |
  | ---------- | ----------- | ------------ |
  | `.......1` |      1      |       7      |
  | `......10` |      2      |      14      |
  | `.....100` |      3      |      21      |
  | `....1000` |      4      |      28      |
  | `...10000` |      5      |      35      |
  | `..100000` |      6      |      42      |
  | `.1000000` |      7      |      49      |
  | `10000000` |      8      |      56      |
  | `00000000` |      9      |      64      |

The full value is stored little endian, so you simply read the first byte (low byte) in the stream to get the full length, and it has the exact same compactness of VLQ/LEB128 (7 bits per byte).

Even better: modern chips have instructions that decode this field in one shot (callable via builtin):

https://github.com/kstenerud/ksbonjson/blob/main/library/src...

    static inline size_t decodeLengthFieldTotalByteCount(uint8_t header) {
        return (size_t)__builtin_ctz(header) + 1;
    }

After running this builtin, you simply re-read the memory location for the specified number of bytes, then cast to a little-endian integer, then shift right by the same number of bits to get the final payload - with a special case for `00000000`, although numbers that big are rare. In fact, if you limit yourself to max 56 bit numbers, the algorithm becomes entirely branchless (even if your chip doesn't have the builtin).

https://github.com/kstenerud/ksbonjson/blob/main/library/src...

It's one of the things I did to make BONJSON 35x faster to decode/encode compared to JSON.

https://github.com/kstenerud/bonjson

If you wanted to maintain ASCII compatibility, you could use a 0-based unary code going left-to-right, but you lose a number of the speed benefits of a little endian friendly encoding (as well as the self-synchronization of UTF-8 - which admittedly isn't so important in the modern world of everything being out-of-band enveloped and error-corrected). But it would still be a LOT faster than VLQ/LEB128.

We can do better than one branch per byte - we can have it per 8-bytes at least.

We'd use `vpmovb2m`[1] on a ZMM register (64-bytes at a time), which fills a 64-bit mask register with the MSB of each byte in the vector.

Then process the mask register 1 byte at a time, using it as an index into a 256-entry jump table. Each entry would be specialized to process the next 8 bytes without branching, and finish with conditional branch to the next entry in the jump table or to the next 64-bytes. Any trailing ones in each byte would simply add them to a carry, which would be consumed up to the most significant zero in the next eightbytes.

[1]:https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

  • Sure, but with the above algorithm you could do it in zero branches, and in parallel if you like.

    • Decoding into integers may be faster, but it's kind of missing the point why I suggested VLQs as opposed to EBML's variable length integers - they're not a good fit for string handling. In particular, if we wanted to search for a character or substring we'd have to start from the beginning of the stream and traverse linearly, because there's no synchronization - the payload bytes are indistinguishable from header bytes, making a parallel search not practical.

      While you might be able to have some heuristic to determine whether a character is a valid match, it may give false positives and it's unlikely to be as efficient as "test if the previous byte's MSB is zero". We can implement parallel search with VLQs because we can trivially synchronize the stream to next nearest character in either direction - it's partially-synchronizing.

      Obviously not as good as UTF-8 or UTF-16 which are self-synchronizing, but it can be implemented efficiently and cut encoding size.