Comment by sawyna
7 hours ago
I have always wondered - what if the utf-8 space is filled up? Does it automatically promote to having a 5th byte? Is that part of the spec? Or are we then talking about utf-16?
7 hours ago
I have always wondered - what if the utf-8 space is filled up? Does it automatically promote to having a 5th byte? Is that part of the spec? Or are we then talking about utf-16?
UTF-8 can represent up to 1,114,112 characters in Unicode. And in Unicode 15.1 (2023, https://www.unicode.org/versions/Unicode15.1.0/) a total of 149,813 characters are included, which covers most of the world's languages, scripts, and emojis. That leaves a 960K space for future expansion.
So, it won't fill up during our lifetime I guess.
I wouldn't be too quick to jump to that conclusion, we could easily shove another 960k emojis into the spec!
Nothing is automatic.
If we ever needed that many characters, yes the most obvious solution would be a fifth byte. The standard would need to be explicitly extended though.
But that would probably require having encountered literate extraterrestrial species to collect enough new alphabets to fill up all the available code points first. So seems like it would be a pretty cool problem to have.
utf-8 is just an encoding of unicode. UTF-8 is specified in a way so that it can encode all unicode codepoints up to 0x10FFFF. It doesn't extend further. And UTF-16 also encodes unicode in a similar same way, it doesn't encode anything more.
So what would need to happen first would be that unicode decides they are going to include larger codepoints. Then UTF-8 would need to be extended to handle encoding them. (But I don't think that will happen.)
It seems like Unicode codepoints are less than 30% allocated, roughly. So there's 70% free space..
---
Think of these three separate concepts to make it clear. We are effectively dealing with two translations - one from the abstract symbol to defined unicode code point. Then from that code point we use UTF-8 to encode it into bytes.
1. The glyph or symbol ("A")
2. The unicode code point for the symbol (U+0041 Latin Capital Letter A)
3. The utf-8 encoding of the code point, as bytes (0x41)
As an aside: UTF-8, as originally specified in RFC 2279, could encode codepoints up to U+7FFFFFFF (using sequences of up to six bytes). It was later restricted to U+10FFFF to ensure compatibility with UTF-16.
[dead]