Comment by ChrisSD

16 days ago

`char8_t` is probably one of the more baffling blunders of the standards committee.

52 comments

ChrisSD

jjmarr 16 days ago

there is no guarantee `char` is 8 bits, nor that it represents text, or even a particular encoding.

If your codebase has those guarantees, go ahead and use it.

hackyhacky 15 days ago
> there is no guarantee `char` is 8 bits, nor that it represents text, or even a particular encoding.
True, but sizeof(char) is defined to be 1. In section 7.6.2.5:
"The result of sizeof applied to any of the narrow character types is 1"
In fact, char and associated types are the only types in the standard where the size is not implementation-defined.
So the only way that a C++ implementation can conform to the standard and have a char type that is not 8 bits is if the size of a byte is not 8 bits. There are historical systems that meet that constraint but no modern systems that I am aware of.
[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/n49...
- int_19h 14 days ago
  
  That would be any CPU with word-addressing only. Which, granted, is very exotic today, but they do still exist: https://www.analog.com/en/products/adsp1802.html
- gpderetta 15 days ago
  
  Don't some modern DSPs still have 32bit as minimum addressable memory? Or is it a thing of the past?
  
  1 reply →
20k 16 days ago
char8_t also isn't guaranteed to be 8-bits, because sizeof(char) == 1 and sizeof(char8_t) >= 1. On a platform where char is 16 bits, char8_t will be 16 bits as well
The cpp standard explicitly says that it has the same size, typed, signedness and alignment as unsigned char, but its a distinct type. So its pretty useless, and badly named
- 1718627440 15 days ago
  
  Wouldn't it be rather the case that char8_t just wouldn't exist on that platform? At least that's the case with the uintN_t types, they are just not available everywhere. If you want something that is always available you need to use uintN_least_t or uintN_fast_t.
- jjmarr 15 days ago
  
  wtf
  
  1 reply →
Maxatar 16 days ago
There's no guarantee char8_t is 8 bits either, it's only guaranteed to be at least 8 bits.
- hackyhacky 15 days ago
  
  > There's no guarantee char8_t is 8 bits either, it's only guaranteed to be at least 8 bits.
  Have you read the standard? It says: "The result of sizeof applied to any of the narrow character types is 1." Here, "narrow character types" means char and char8_t. So technically they aren't guaranteed to be 8 bits, but they are guaranteed to be one byte.
  
  1 reply →
- CyberDildonics 15 days ago
  
  What platforms have char8_t as more than 8 bits?
  
  1 reply →
jhasse 15 days ago
That's where the standard should come in and say something like "starting with C++26 char is always 1 byte and signed. std::string is always UTF-8" Done, fixed unicode in C++.
But instead we get this mess. I guess it's because there's too much Microsoft in the standard and they are the only ones not having UTF-8 everywhere in Windows yet.
- fluoridation 15 days ago
  
  char is always 1 byte. What it's not always is 1 octet.
  
  1 reply →
- jstimpfle 15 days ago
  
  std::string is not UTF-8 and can't be made UTF-8. It's encoding agnostic, its API is in terms of bytes not codepoints.
  
  7 replies →
dataflow 16 days ago
How many non-8-bit-char platforms are there with char8_t support, and how many do we expect in the future?
- RobotToaster 16 days ago
  
  Mostly DSPs
  
  15 replies →
- dspwizard 15 days ago
  
  TI C2000 is one example
  
  1 reply →
Asmod4n 15 days ago
char on linux arm is unsigned, makes for fun surprises when you only ever dealt with x86 and assumed char to be signed everywhere.
- pkasting 15 days ago
  
  This bit us in Chromium. We at least discussed forcing the compiler to use unsigned char on all platforms; I don't recall if that actually happened.
  
  2 replies →
kps 15 days ago

Related: in C at least (C++ standards are tl;dr), type names like `int32_t` are not required to exist. Most uses, in portable code, should be `int_least32_t`, which is required.