Comment by jstimpfle

14 days ago

std::string is not UTF-8 and can't be made UTF-8. It's encoding agnostic, its API is in terms of bytes not codepoints.

Of course it can be made UTF-8. Just add a codepoints_size() method and other helpers.

But it isn't really needed anyway: I'm using it for UTF-8 (with helper functions for the 1% cases where I need codepoints) and it works fine. But starting with C++20 it's starting to get annoying because I have to reinterpret_cast to the useless u8 versions.

  • First, because of existing constraints like mutability though direct buffer access, a hypothetical codepoints_size() would require recomputation each time which would be prohibitively expensive, in particular because std::string is virtually unbounded.

    Second, there is also no way to be able to guarantee that a string encodes valid UTF-8, it could just be whatever.

    You can still just use std::string to store valid encoded UTF-8, you just have to be a little bit careful. And functions like codepoints_size() are pretty fringe -- unless you're not doing specialized Unicode transformations, it's more typical to just treat strings as opaque byte slices in a typical C++ application.