← Back to context

Comment by i2talics

1 day ago

Non-canonical encodings are actually quite useful for some applications that need variable length integers. DWARF and WASM both use LEB128.

The problem is linking: a compiler needs to emit code into independent translation units, which contain "missing" references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don't know where the location of other code is yet, we don't know how big the number representing that location is yet, which means that we don't know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!

The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers can be converted to a non-canonical 5 byte form that is "wasteful" but its a worthwhile tradeoff because it solves this issue. Other integers that don't need to be linked can be packed in a smaller var int form to save space.

I've often done same thing with encoding msgpack maps while streaming in key/value pairs

  • Neat! It's a useful technique whenever you don't know or want to defer knowing the size of an integer until a later time, but need to allocate space for it up front.

    I'm wary of introducing these forced-canonical encodings by someone hyper focused on "efficiency" and "security" without reconsidering additional use cases.