← Back to context

Comment by silisili

1 year ago

This is interesting, thanks.

> 0b11111110 - All 1s with a trailing 0, indicates heap allocated

> 0b11XXXXXX - Two leading 1s, indicates inline, with the trailing 6 bits used to store the length

I stared at this for too long, as it allows collision. Then I realized you'd never set the third bit, it should probably have been written 0b110XXXXX and recorded that 5 bits are used for length. Right or did I understand it wrong?

Probably this isn't helpful anyway - what's actually going on is more complicated and is explained later at a high level or I'll try now:

Rust has "niches" - bit patterns which are never used by that type and thus can be occupied by something else in a sum type (Rust's enum) which adds to that type. But stable Rust doesn't allow third parties to promise arbitrary niches exist for a type they made.

However, if you make a simple enumeration of N possibilities that automatically has a niche of all the M-N bit patterns which weren't needed by your enumeration in the M value machine integer that was chosen to store this enumerated type (M will typically be 256 or 65536 depending on how many things you enumerated)

So, CompactString has a custom enum type LastUtf8Char which it uses for the last byte in its data structure - this has values V0 through V191 corresponding to the 192 possible last bytes of a UTF-8 string. That leaves 64 bit patterns unused. Then L0 through L23 represent lengths - inline strings of length 0 to 23 inclusive which didn't need this last byte (if it was 24 then that's V0 through V191). Now we've got 40 bit patterns left.

Then one bit pattern (the pattern equivalent to the unsigned integer 216) signifies that this string data lives on the heap, the rest should be interpreted accordingly, and another (217) signifies that it's a weird static allocation (I do not know why you'd do this)

That leaves 38 bit patterns unused when the type is finished using any it wanted so there's still a niche for Option<CompactString> or MyCustomType<CompactString>

  • Author of compact_str here, you hit the nail on the head, great explanation!

    > ... and another (217) signifies that it's a weird static allocation (I do not know why you'd do this)

    In addition to String Rust also has str[1], which is an entirely different type. It's common to represent string literals known at compile time as `&'static str`, but they can't be used in all of the same places that a String can. For example, you can't put a &'static str into a Vec<String> unless you first heap allocate and create a String. We added the additional variant of 217 so users of CompactString could abstract over both string literals and strings created at runtime to solve cases like the example.

    [1]: https://doc.rust-lang.org/std/primitive.str.html

    • Thanks! And the explanation of 217 makes sense too.

      Since I have you here, wouldn't it be better to name that type "LastByte" or something? It's not a (Rust) char, and it's not necessarily UTF-8 whereas it is definitely the last byte.

      2 replies →