Comment by vitaut

1 month ago

Somewhat notable is that `char8_t` is banned with very reasonable motivation that applies to most codebases:

> Use char and unprefixed character literals. Non-UTF-8 encodings are rare enough in Chromium that the value of distinguishing them at the type level is low, and char8_t* is not interconvertible with char* (what ~all Chromium, STL, and platform-specific APIs use), so using u8 prefixes would obligate us to insert casts everywhere. If you want to declare at a type level that a block of data is string-like and not an arbitrary binary blob, prefer std::string[_view] over char*.

58 comments

vitaut

ChrisSD 1 month ago

`char8_t` is probably one of the more baffling blunders of the standards committee.

jjmarr 1 month ago
there is no guarantee `char` is 8 bits, nor that it represents text, or even a particular encoding.
If your codebase has those guarantees, go ahead and use it.
- hackyhacky 1 month ago
  
  > there is no guarantee `char` is 8 bits, nor that it represents text, or even a particular encoding.
  True, but sizeof(char) is defined to be 1. In section 7.6.2.5:
  "The result of sizeof applied to any of the narrow character types is 1"
  In fact, char and associated types are the only types in the standard where the size is not implementation-defined.
  So the only way that a C++ implementation can conform to the standard and have a char type that is not 8 bits is if the size of a byte is not 8 bits. There are historical systems that meet that constraint but no modern systems that I am aware of.
  [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/n49...
  
  5 replies →
- 20k 1 month ago
  
  char8_t also isn't guaranteed to be 8-bits, because sizeof(char) == 1 and sizeof(char8_t) >= 1. On a platform where char is 16 bits, char8_t will be 16 bits as well
  The cpp standard explicitly says that it has the same size, typed, signedness and alignment as unsigned char, but its a distinct type. So its pretty useless, and badly named
  
  4 replies →
- Maxatar 1 month ago
  
  There's no guarantee char8_t is 8 bits either, it's only guaranteed to be at least 8 bits.
  
  4 replies →
- jhasse 1 month ago
  
  That's where the standard should come in and say something like "starting with C++26 char is always 1 byte and signed. std::string is always UTF-8" Done, fixed unicode in C++.
  But instead we get this mess. I guess it's because there's too much Microsoft in the standard and they are the only ones not having UTF-8 everywhere in Windows yet.
  
  10 replies →
- dataflow 1 month ago
  
  How many non-8-bit-char platforms are there with char8_t support, and how many do we expect in the future?
  
  18 replies →
- Asmod4n 1 month ago
  
  char on linux arm is unsigned, makes for fun surprises when you only ever dealt with x86 and assumed char to be signed everywhere.
  
  3 replies →
- kps 1 month ago
  
  Related: in C at least (C++ standards are tl;dr), type names like `int32_t` are not required to exist. Most uses, in portable code, should be `int_least32_t`, which is required.

tarlinian 1 month ago

Isn't the real reason to use char8_t over char that it that char8_t* are subject to the same strict aliasing rules as all other non-char primitive types? (i.e., the compiler doesn't have to worry that a char8_t* could point to any random piece of memory like it would for char*?).

pkasting 1 month ago

At least in Chromium that wouldn't help us, because we disable strict aliasing (and have to, as there are at least a few core places where we violate it and porting to an alternative looks challenging; some of our core string-handling APIs that presume that wchar_t* and char16_t* are actually interconvertible on Windows, for example, would have to begin memcpying, which rules out certain API shapes and adds a perf cost to the rest).
vitaut 1 month ago

The main effect of this is that some of the conversions between char and char8_t are inefficient.

cpeterso 1 month ago

> using u8 prefixes would obligate us to insert casts everywhere.

Unfortunately, casting a char8_t* to char* (and then accessing the data through the char* pointer) is undefined behavior.

pkasting 1 month ago

Yes, reading the actual data would still be UB. Hopefully will be fixed in C++29: https://github.com/cplusplus/papers/issues/592