Comment by jstimpfle
24 days ago
First, because of existing constraints like mutability though direct buffer access, a hypothetical codepoints_size() would require recomputation each time which would be prohibitively expensive, in particular because std::string is virtually unbounded.
Second, there is also no way to be able to guarantee that a string encodes valid UTF-8, it could just be whatever.
You can still just use std::string to store valid encoded UTF-8, you just have to be a little bit careful. And functions like codepoints_size() are pretty fringe -- unless you're not doing specialized Unicode transformations, it's more typical to just treat strings as opaque byte slices in a typical C++ application.
Perfect is the enemy of good. Or do you think the current mess is better?
std::string _cannot_ be made "always UTF-8". Is that really so contentious?
You can still use it to contain UTF-8 data. It is commonly done.
I never said always. Just add some new methods for which it has to be UTF-8. All current functions that need an encoding (e.g. text IO) also switch to UTF-8. Of course you could still save arbitrary binary data in it.
2 replies →