Comment by jhasse
13 hours ago
That's where the standard should come in and say something like "starting with C++26 char is always 1 byte and signed. std::string is always UTF-8" Done, fixed unicode in C++.
But instead we get this mess. I guess it's because there's too much Microsoft in the standard and they are the only ones not having UTF-8 everywhere in Windows yet.
char is always 1 byte. What it's not always is 1 octet.
you're right. What I meant was that it should always be 8 bit, too.
std::string is not UTF-8 and can't be made UTF-8. It's encoding agnostic, its API is in terms of bytes not codepoints.
Of course it can be made UTF-8. Just add a codepoints_size() method and other helpers.
But it isn't really needed anyway: I'm using it for UTF-8 (with helper functions for the 1% cases where I need codepoints) and it works fine. But starting with C++20 it's starting to get annoying because I have to reinterpret_cast to the useless u8 versions.