Comment by jhasse

1 month ago

That's where the standard should come in and say something like "starting with C++26 char is always 1 byte and signed. std::string is always UTF-8" Done, fixed unicode in C++.

But instead we get this mess. I guess it's because there's too much Microsoft in the standard and they are the only ones not having UTF-8 everywhere in Windows yet.

10 comments

jhasse

fluoridation 1 month ago

char is always 1 byte. What it's not always is 1 octet.

jhasse 1 month ago

you're right. What I meant was that it should always be 8 bit, too.

jstimpfle 1 month ago

std::string is not UTF-8 and can't be made UTF-8. It's encoding agnostic, its API is in terms of bytes not codepoints.

jhasse 1 month ago
Of course it can be made UTF-8. Just add a codepoints_size() method and other helpers.
But it isn't really needed anyway: I'm using it for UTF-8 (with helper functions for the 1% cases where I need codepoints) and it works fine. But starting with C++20 it's starting to get annoying because I have to reinterpret_cast to the useless u8 versions.
- jstimpfle 1 month ago
  
  First, because of existing constraints like mutability though direct buffer access, a hypothetical codepoints_size() would require recomputation each time which would be prohibitively expensive, in particular because std::string is virtually unbounded.
  Second, there is also no way to be able to guarantee that a string encodes valid UTF-8, it could just be whatever.
  You can still just use std::string to store valid encoded UTF-8, you just have to be a little bit careful. And functions like codepoints_size() are pretty fringe -- unless you're not doing specialized Unicode transformations, it's more typical to just treat strings as opaque byte slices in a typical C++ application.
  
  5 replies →