← Back to context

Comment by parkmycar

1 year ago

Hey I'm the author of compact_str, thanks for the kind words!

Fun fact, it was this fasterthanlime post that originally inspired me to play around with small strings and led to the creation of compact_str.

Do you think it is possible to integrate this with an existing string interner in the Rust ecosystem? Or one would need to roll their own?

The goal would be to compare two strings using only the 24-byte values, without touching the allocated part (if it points into a string larger than 24 bytes)

The first thing that makes me think this is impossible is that you only define an owned variant for your string type. But your can cleverly wrap a &'static str rather than allocate on heap, so if the interner can give a &'static str I think this could work, with some boilerplate. Namely: for strings smaller than 24 bytes don't intern and build a CompactString inline, for larger strings, intern, get a &'static str, then build a CompactString out of it. Then, two check if two strings are equal, you just compare its 24 bytes, and don't need to touch the interner for this at all.

However this only works if the interner actually leaks memory. If it returns you a &'a str that isn't 'static, then you can't store it on CompactString, unless your lib also provided a borrowed variant.

Also, to think about it, since interned strings are immutable (they are not a "string builder" like String is), you don't really need the capacity value for them, just the length. So it suggests a 16 bytes size, not 24. (but one could imagine an application where it's optimal to store 24 bytes inline rather than 16 anyway)

I think that this could be achieved with an 16 bytes &str wrapper, maybe something like &CompactStr, that works just like your CompactString, but wraps a &str instead (and offers no owned variant). Maybe such a thing could conceivably be included in the compact_string crate?

(Maybe make it 24 bytes even if it "wastes" some bytes, just to make it bitwise compatible with CompactString, so that borrowing CompactString into &CompactStr is zero cost - and then just zero out the remaining 8 bytes when a &CompactStr is stored on the heap)

[0] I was reading this post https://dev.to/cad97/string-interners-in-rust-797 that was written in response to this fasterthanlime post, but it contrasts interner with small string optimization, when you actually could have both!

  • I think the byteyarn library mentioned elsewhere in this thread meets your requirements:

    https://mcyoung.xyz/2023/08/09/yarns/

    • This is perfect, thanks!

      ... well except it stores up to 15 bytes inline rather than using the UTF-8 trick to cram an extra byte. (But maybe it can't; or at least, ByteYarn really can only store 15 bytes, even though Yarn could store 16 bytes maybe, but this would pose problems when converting between those types because I suppose this conversion is zero cost)

I've been wondering something for ages. Where did you get the 24 byte number, and how does it compare in Unicode terms? That is, did you analyze a large corpus and determine that 24 bytes was right for the largest number of strings? And does it come out to, say, 10 Unicode characters? Whenever I think about designing a new language, this very issue pops up.

  • To add some more detail to sibling's answer:

    The optimal size will depend on the application. It's certainly reasonable that in some applications, many/most strings would be under 24 bytes and thus the small string optimization in many of these implementations would be beneficial. Perhaps in some other application strings are closer to 32 bytes (or something else) and then a larger stack size would be warranted. And in yet other applications, strings are large and no small string optimizations will make any difference, and if anything will slow the application down with unnecessary bookkeeping.

    I do find it surprising that none of the implementations in the various comments linked in this thread seem to provide user-tunable sizes; or at least I haven't seen it. Because I can certainly imagine cases where the optimal size is > 24.

  • This style of small string optimization tries to take up the same amount of space on the stack as a "normal" heap-allocated string. On 64-bit platforms that is 24 bytes: 8 bytes for the pointer to the heap allocation, 8 bytes for the number of characters/bytes in the string, and 8 bytes for the allocation capacity.

    It's quite possible to make the small string buffer larger, but that comes at the cost of the large string representation taking up more space than necessary on the stack. IIRC libstdc++ does this, which makes its std::string take up 32 bytes on the stack.

    • > It's quite possible to make the small string buffer larger, but that comes at the cost of the large string representation taking up more space than necessary on the stack. IIRC libstdc++ does this, which makes its std::string take up 32 bytes on the stack.

      Though to follow through on that, 24 bytes is also more than necessary. You don't have to be very clever to shrink your string size to 16 bytes (6 for the pointer, 6 for the size, 4 to store either capacity or spare capacity as floating point).

      1 reply →

  • Not sure why the downvotes? This is a sincere question.

    • For a lot of people it will seem obvious that String (Rust's standard library growable string type) is 24 bytes. CompactString is 24 bytes so that it is exactly the same size as String, that's the main idea. That's why they may not have believed you were sincere in asking.

      Why is Rust's String 24 bytes? Well, this is a growable string type. So we need to store 1) where on the heap our growable buffer is and 2) how big that heap space is ("Capacity"), and 3) how much of it we're using already to store the string ("Size" or "Length"). On modern 64-bit computers it's reasonable to use a 64-bit (8 byte) value for all three of these facts, 3 times 8 = 24.

      In fact Rust (unlike C++) insists on doing this the simplest and fastest way possible so the String type is more or less literally Vec<u8> - a growable array of bytes, plus a guarantee that those bytes are UTF-8 encoded text. Vec<u8> is likewise 24-bytes.

      The rationale is that being simple is a good fundamental design, and (as CompactString illustrates) people can build more sophisticated types if they want to unlock specific optimisations which may suit their application.

      1 reply →

Is 26 characters next? I think the number of bytestrings of length <= 26 that are also valid UTF-8 is only 0.021 * 256^24.

  • I think the problem you encounter with this is that you can no longer reference the small string as a &str.