← Back to context

Comment by dataflow

1 year ago

Note for C++ developers: Their trick is only possible because the strings are UTF-8 and not null-terminated. It wouldn't work as a drop-in for standard strings in C++.

Null-terminated strings really are a mistake. Make vectorized algorithms problematic by forcing them to account for page size and paged memory in general as well as always scan for NUL, cannot be easily sliced without re-allocation, are opposite to how languages with better string primitives define them and in general don't save much by passing a single pointer over ptr + length.

  • It's not really possible to get rid of them in C++ however, given a staggering amount of legacy APIs that require them. Constantly converting every time you have to call a system API with your string is even worse.

    • FWIW Rust also has null-terminated strings available under its std::ffi module. I’m not sure it would be feasible to migrate C++ to the approach of multiple string types now, and I’m not sure it would have been the right approach for C++11 given C++’s approach to interop, but it’s definitely possible to support interoperability with those legacy APIs without constraining a default string type to null-termination.

std::string isn't null terminated (or at least it isn't guaranteed to be, I don't think it's forbidden for an implementation to do that).

That's why the c_str method exists, so you can get a pointer to a null terminated character array

  • > std::string isn't null terminated

    It is as of C++11. The constness of c_str() threw a wrench into that as soon as C++ got a threading model.

  • std::string is guaranteed to be null terminated since c++11.

    std::string::c_str returns the same address as &string[0]

    • > std::string::c_str returns the same address as &string[0]

      Note that this by itself doesn't imply null-termination, though as you say strings are indeed null-terminated now.

      Edit: This is not particularly obvious, but the reason this has nothing to do with allocation or the return value is that the implementation could still leave space for the null terminator, but avoid actually setting it to zero until c_str() is invoked. That would neither affect the returned pointer nor the constant-time guarantee.

      7 replies →

  • I think std::string is actually required to be null terminated now in the latest standards. But even before it was basically required as that was the only way to make c_str() constant time.

    • > But even before it was basically required as that was the only way to make c_str() constant time.

      Note that c_str() could've inserted the terminator in constant time, as long as the string kept space reserved for that. So this wouldn't have violated constant-time-ness, and it's not an insane thing to do considering it would save some instructions elsewhere. But yeah, the flexibility wasn't all that useful even in the beginning, and became even more useless as soon as C++ incorporated threading, due to the constness of the function.

More of a knock against C than C++, seeing as today’s C++ tends to prefer `string_view`s or at worst iterator pairs, neither of which use null-termination. (Saying that as someone who prefers C to C++ and avoids Rust exactly because of “better C++” vibes—I don’t want a better version of C++, if anything I want a much much simpler one that’s worse at some things.)

That said, I don’t see why it wouldn’t be possible to cram in 24 bytes of null-terminated payload (so 23 useful ones, the best you could hope for with null termination) into the same structure the same way by storing the compact version with null termination and ensuring the last byte is also always zero. For extra style points, define the last byte to be 24 minus payload length instead so you don’t need to recompute the length in the inline case.

To be clear: none of this makes null termination not dumb.

  • > That said, I don’t see why it wouldn’t be possible to cram in 24 bytes of null-terminated payload (so 23 useful ones, the best you could hope for with null termination) into the same structure the same way by storing the compact version with null termination and ensuring the last byte is also always zero.

    I want to say libc++ and maybe MSVC do something along those lines in their std::string implementations.

    > For extra style points, define the last byte to be 24 minus payload length instead so you don’t need to recompute the length in the inline case.

    IIRC Facebook's FBString from Folly does (did?) that?

    • > I want to say libc++ and maybe MSVC do something along those lines in their std::string implementations.

      Here's Raymond Chen on this topic earlier this same year. As a bonus since you're looking at this in August it's more or less correct now, whereas when it was published it had numerous serious errors. Whether the standard library implementation of such an important type should be so complicated that an expert makes numerous errors is another question...

      https://devblogs.microsoft.com/oldnewthing/20240510-00/?p=10...

      So, libc++ gets closest, 1 flag byte + 22 bytes of text + 1 byte of ASCII NUL = 24 bytes

      The others are much worse, larger (32 bytes on modern computers) yet with lower SSO capacity (15 bytes of text).

      1 reply →

  • > That said, I don’t see why it wouldn’t be possible to cram in 24 bytes of null-terminated payload

    Note that I didn't say this is impossible, just that the given trick wouldn't work.

    However, this is impossible for general strings. The only way could possibly make this work is if you constrain the inline string somehow (e.g., to UTF-8), so that some shorter strings failing that constraint are forced to go on the heap too. Otherwise you have 1 fixed zero byte at the end, and 23 fully flexible bytes, leaving you no way to represent an out-of-line string.

    (Well, you could do it if you use the address as a key into some static map or such where you shove the real data, but that's cheating and beside the point here.)

    • Think that through again. An inline string needs a fixed 0 byte at the end. A heap string does not. Therefore if the last byte is anything other than 0 you have a heap string.

      Inline strings only use 0.4% of your possible values.

      1 reply →