Comment by jjmarr

21 hours ago

there is no guarantee `char` is 8 bits, nor that it represents text, or even a particular encoding.

If your codebase has those guarantees, go ahead and use it.

> there is no guarantee `char` is 8 bits, nor that it represents text, or even a particular encoding.

True, but sizeof(char) is defined to be 1. In section 7.6.2.5:

"The result of sizeof applied to any of the narrow character types is 1"

In fact, char and associated types are the only types in the standard where the size is not implementation-defined.

So the only way that a C++ implementation can conform to the standard and have a char type that is not 8 bits is if the size of a byte is not 8 bits. There are historical systems that meet that constraint but no modern systems that I am aware of.

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/n49...

  • Don't some modern DSPs still have 32bit as minimum addressable memory? Or is it a thing of the past?

    • If you're on such a system, and you write code that uses char, then perhaps you deserve whatever mess that causes you.

char8_t also isn't guaranteed to be 8-bits, because sizeof(char) == 1 and sizeof(char8_t) >= 1. On a platform where char is 16 bits, char8_t will be 16 bits as well

The cpp standard explicitly says that it has the same size, typed, signedness and alignment as unsigned char, but its a distinct type. So its pretty useless, and badly named

  • Wouldn't it be rather the case that char8_t just wouldn't exist on that platform? At least that's the case with the uintN_t types, they are just not available everywhere. If you want something that is always available you need to use uintN_least_t or uintN_fast_t.

  • wtf

    • It is pretty consistent. It is part of the C Standard and a feature meant to make string handling better, it would be crazy if it wasn't a complete clusterfuck.

There's no guarantee char8_t is 8 bits either, it's only guaranteed to be at least 8 bits.

  • > There's no guarantee char8_t is 8 bits either, it's only guaranteed to be at least 8 bits.

    Have you read the standard? It says: "The result of sizeof applied to any of the narrow character types is 1." Here, "narrow character types" means char and char8_t. So technically they aren't guaranteed to be 8 bits, but they are guaranteed to be one byte.

    • Yes, but the byte is not guaranteed to be 8 bits, because on many ancient computers it wasn't.

      The poster to whom you have replied has read correctly the standard.

  • What platforms have char8_t as more than 8 bits?

    • Well platforms with CHAR_BIT != 8. In c and c++ char and there for byte is atleast 8 bytes not 8 bytes. POSIX does force CHAR_BIT == 8. I think only place is in embeded and that to some DSPs or ASICs like device. So in practice most code will break on those platforms and they are very rare. But they are still technically supported by c and c++ std. Similarly how c still suported non 2's complement arch till 2023.

How many non-8-bit-char platforms are there with char8_t support, and how many do we expect in the future?

  • Mostly DSPs

    • Is there a single esoteric DSP in active use that supports C++20? This is the umpteenth time I've seen DSP's brought up in casual conversations about C/C++ standards, so I did a little digging:

      Texas Instruments' compiler seems to be celebrating C++14 support: https://www.ti.com/tool/C6000-CGT

      CrossCore Embedded Studio apparently supports C++11 if you pass a switch in requesting it, though this FAQ answer suggests the underlying standard library is still C++03: https://ez.analog.com/dsp/software-and-development-tools/cce...

      Everything I've found CodeWarrior related suggests that it is C++03-only: https://community.nxp.com/pwmxy87654/attachments/pwmxy87654/...

      Aside from that, from what I can tell, those esoteric architectures are being phased out in lieu of running DSP workloads on Cortex-M, which is just ARM.

      I'd love it if someone who was more familiar with DSP workloads would chime in, but it really does seem that trying to be the language for all possible and potential architectures might not be the right play for C++ in 202x.

      Besides, it's not like those old standards or compilers are going anywhere.

      7 replies →

That's where the standard should come in and say something like "starting with C++26 char is always 1 byte and signed. std::string is always UTF-8" Done, fixed unicode in C++.

But instead we get this mess. I guess it's because there's too much Microsoft in the standard and they are the only ones not having UTF-8 everywhere in Windows yet.

  • std::string is not UTF-8 and can't be made UTF-8. It's encoding agnostic, its API is in terms of bytes not codepoints.

    • Of course it can be made UTF-8. Just add a codepoints_size() method and other helpers.

      But it isn't really needed anyway: I'm using it for UTF-8 (with helper functions for the 1% cases where I need codepoints) and it works fine. But starting with C++20 it's starting to get annoying because I have to reinterpret_cast to the useless u8 versions.

char on linux arm is unsigned, makes for fun surprises when you only ever dealt with x86 and assumed char to be signed everywhere.

  • This bit us in Chromium. We at least discussed forcing the compiler to use unsigned char on all platforms; I don't recall if that actually happened.

Related: in C at least (C++ standards are tl;dr), type names like `int32_t` are not required to exist. Most uses, in portable code, should be `int_least32_t`, which is required.