Comment by mananaysiempre
1 year ago
More of a knock against C than C++, seeing as today’s C++ tends to prefer `string_view`s or at worst iterator pairs, neither of which use null-termination. (Saying that as someone who prefers C to C++ and avoids Rust exactly because of “better C++” vibes—I don’t want a better version of C++, if anything I want a much much simpler one that’s worse at some things.)
That said, I don’t see why it wouldn’t be possible to cram in 24 bytes of null-terminated payload (so 23 useful ones, the best you could hope for with null termination) into the same structure the same way by storing the compact version with null termination and ensuring the last byte is also always zero. For extra style points, define the last byte to be 24 minus payload length instead so you don’t need to recompute the length in the inline case.
To be clear: none of this makes null termination not dumb.
> That said, I don’t see why it wouldn’t be possible to cram in 24 bytes of null-terminated payload (so 23 useful ones, the best you could hope for with null termination) into the same structure the same way by storing the compact version with null termination and ensuring the last byte is also always zero.
I want to say libc++ and maybe MSVC do something along those lines in their std::string implementations.
> For extra style points, define the last byte to be 24 minus payload length instead so you don’t need to recompute the length in the inline case.
IIRC Facebook's FBString from Folly does (did?) that?
> I want to say libc++ and maybe MSVC do something along those lines in their std::string implementations.
Here's Raymond Chen on this topic earlier this same year. As a bonus since you're looking at this in August it's more or less correct now, whereas when it was published it had numerous serious errors. Whether the standard library implementation of such an important type should be so complicated that an expert makes numerous errors is another question...
https://devblogs.microsoft.com/oldnewthing/20240510-00/?p=10...
So, libc++ gets closest, 1 flag byte + 22 bytes of text + 1 byte of ASCII NUL = 24 bytes
The others are much worse, larger (32 bytes on modern computers) yet with lower SSO capacity (15 bytes of text).
Huh, I thought MSVC had a libc++-style SSO implementation. Now I have to wonder where I got that mistaken impression :(
> That said, I don’t see why it wouldn’t be possible to cram in 24 bytes of null-terminated payload
Note that I didn't say this is impossible, just that the given trick wouldn't work.
However, this is impossible for general strings. The only way could possibly make this work is if you constrain the inline string somehow (e.g., to UTF-8), so that some shorter strings failing that constraint are forced to go on the heap too. Otherwise you have 1 fixed zero byte at the end, and 23 fully flexible bytes, leaving you no way to represent an out-of-line string.
(Well, you could do it if you use the address as a key into some static map or such where you shove the real data, but that's cheating and beside the point here.)
Think that through again. An inline string needs a fixed 0 byte at the end. A heap string does not. Therefore if the last byte is anything other than 0 you have a heap string.
Inline strings only use 0.4% of your possible values.
Oof, you're right, thank you. In my mind the last byte was obviously zero for a heap string too, since the pointer or sizes would've had a zero upper byte. Somehow I never accounted for the fact that on 64-bit there's no need to represent it that way. Fantastic point!