Comment by sparkie

13 hours ago

We can do better than one branch per byte - we can have it per 8-bytes at least.

We'd use `vpmovb2m`[1] on a ZMM register (64-bytes at a time), which fills a 64-bit mask register with the MSB of each byte in the vector.

Then process the mask register 1 byte at a time, using it as an index into a 256-entry jump table. Each entry would be specialized to process the next 8 bytes without branching, and finish with conditional branch to the next entry in the jump table or to the next 64-bytes. Any trailing ones in each byte would simply add them to a carry, which would be consumed up to the most significant zero in the next eightbytes.

[1]:https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

2 comments

sparkie

kstenerud 12 hours ago

Sure, but with the above algorithm you could do it in zero branches, and in parallel if you like.

sparkie 12 hours ago

Decoding into integers may be faster, but it's kind of missing the point why I suggested VLQs as opposed to EBML's variable length integers - they're not a good fit for string handling. In particular, if we wanted to search for a character or substring we'd have to start from the beginning of the stream and traverse linearly, because there's no synchronization - the payload bytes are indistinguishable from header bytes, making a parallel search not practical.
While you might be able to have some heuristic to determine whether a character is a valid match, it may give false positives and it's unlikely to be as efficient as "test if the previous byte's MSB is zero". We can implement parallel search with VLQs because we can trivially synchronize the stream to next nearest character in either direction - it's partially-synchronizing.
Obviously not as good as UTF-8 or UTF-16 which are self-synchronizing, but it can be implemented efficiently and cut encoding size.