Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.
If the characters were instead encoded like EBML's variable size integers[1] (but inverting 1 and 0 to keep ASCII compatibility for the single-byte case), and you do a random seek, it wouldn't be as easy (or maybe not even possible) to know if you landed on the beginning of a character or in one of the `xxxx xxxx` bytes.
Right. That's one of the great features of UTF-8. You can move forwards and backwards through a UTF-8 string without having to start from the beginning.
Python has had troubles in this area. Because Python strings are indexable by character, CPython used wide characters. At one point you could pick 2-byte or 4-byte characters when building CPython. Then that switch was made automatic at run time. But it's still wide characters, not UTF-8. One emoji and your string size quadruples.
I would have been tempted to use UTF-8 internally. Indices into a string would be an opaque index type which behaved like an integer to the extent that you could add or subtract small integers, and that would move you through the string. If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated.
That's an unusual case. All the standard operations, including regular expressions, can work on a UTF-8 representation with opaque index objects.
PyCompactUnicodeObject was introduced with Python 3.3, and uses UTF-8 internally. It's used whenever both size and max code point are known, which is most cases where it comes from a literal or bytes.decode() call. Cut memory usage in typical Django applications by 2/3 when it was implemented.
I would probably use UTF-8 and just give up on O(1) string indexing if I were implementing a new string type. It's very rare to require arbitrary large-number indexing into strings. Most use-cases involve chopping off a small prefix (eg. "hex_digits[2:]") or suffix (eg. "filename[-3:]"), and you can easily just linear search these with minimal CPU penalty. Or they're part of library methods where you want to have your own custom traversals, eg. .find(substr) can just do Boyer-Moore over bytes, .split(delim) probably wants to do a first pass that identifies delimiter positions and then use that to allocate all the results at once.
This is Python; finding new ways to subscript into things directly is a graduate student’s favorite pastime!
In all seriousness I think that encoding-independent constant-time substring extraction has been meaningful in letting researchers outside the U.S. prototype, especially in NLP, without worrying about their abstractions around “a 5 character subslice” being more complicated than that. Memory is a tradeoff, but a reasonably predictable one.
Indices into a Unicode string is a highly unusual operation that is rarely needed. A string is Unicode because it is provided by the user or a localized user-facing string. You don't generally need indices.
Programmer strings (aka byte strings) do need indexing operations. But such strings usually do not need Unicode.
Your solution is basically what Swift does. Plus they do the same with extended grapheme clusters (what a human would consider distinct characters mostly), and that’s the default character type instead of Unicode code point. Easily the best Unicode string support of any programming language.
Variable width encodings like UTF-8 and UTF-16 cannot be indexed in O(1), only in O(N). But this is not really a problem! Instead of indexing strings we need to slice them, and generally we read them forwards, so if slices (and slices of slices) are cheap, then you can parse textual data without a problem. Basically just keep the indices small and there's no problem.
> If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated.
What conversion rule do you want to use, though? You either reject some values outright, bump those up or down, or else start with a character index that requires an O(N) translation to a byte index.
"Unicode" aka "wide characters" is the dumbest engineering debacle of the century.
> ascii and codepage encodings are legacy, let's standardize on another forwards-incompatible standard that will be obsolete in five years
> oh, and we also need to upgrade all our infrastructure for this obsolete-by-design standard because we're now keeping it forever
VLQ/LEB128 are a bit better than the EBML's variable size integers. You test the MSB in the byte - `0` means it's the end of a sequence and the next byte is a new sequence. If the MSB is `1`, to find the start of the sequence you walk back until you find the first zero MSB at the end of the previous sequence (or the start of the stream). There are efficient SIMD-optimized implementations of this.
The difference between VLQ and LEB128 is endianness, basically whether the zero MSB is the start or end of a sequence.
It's not self-synchronizing like UTF-8, but it's more compact - any unicode codepoint can fit into 3 bytes (which can encode up to 0x1FFFFF), and ASCII characters remain 1 byte. Can grow to arbitrary sizes. It has a fixed overhead of 1/8, whereas UTF-8 only has overhead of 1/8 for ASCII and 1/3 thereafter. Could be useful compressing the size of code that uses non-ASCII, since most of the mathematical symbols/arrows are < U+3FFF. Also languages like Japanese, since Katakana and Hiragana are also < U+3FFF, and could be encoded in 2 bytes rather than 3.
Unfortunately, VLQ/LEB128 is slow to process due to all the rolling decision points (one decision point per byte, with no ability to branch predict reliably). It's why I used a right-to-left unary code in my stuff: https://github.com/kstenerud/bonjson/blob/main/bonjson.md#le...
The full value is stored little endian, so you simply read the first byte (low byte) in the stream to get the full length, and it has the exact same compactness of VLQ/LEB128 (7 bits per byte).
Even better: modern chips have instructions that decode this field in one shot (callable via builtin):
After running this builtin, you simply re-read the memory location for the specified number of bytes, then cast to a little-endian integer, then shift right by the same number of bits to get the final payload - with a special case for `00000000`, although numbers that big are rare. In fact, if you limit yourself to max 56 bit numbers, the algorithm becomes entirely branchless (even if your chip doesn't have the builtin).
If you wanted to maintain ASCII compatibility, you could use a 0-based unary code going left-to-right, but you lose a number of the speed benefits of a little endian friendly encoding (as well as the self-synchronization of UTF-8 - which admittedly isn't so important in the modern world of everything being out-of-band enveloped and error-corrected). But it would still be a LOT faster than VLQ/LEB128.
That's assuming the text is not corrupted or maliciously modified. There were (are) _numerous_ vulnerabilities due to parsing/escaping of invalid UTF-8 sequences.
Quick googling (not all of them are on-topic tho):
This tendency of requirement overloading, for what can otherwise be a simple solution for a simple problem, is the bane of engineering. In this case, if security is important, it can be addressed separately, e.g. for the underlying text treated as an abstract information block that has to be packaged with corresponding error codes then checked for integrity before consumption. The UTF-8 encoding/decoding process itself doesn't necessarily have to answer the security concerns. Please let the solutions be simple, whenever they can be.
and also use whatever bits are left over encoding the length (which could be in 8 bit blocks so you write 1111/1111 10xx/xxxx to code 8 extension bytes) to encode the number. This is covered in this CS classic
together with other methods that let you compress a text + a full text index for the text into less room than text and not even have to use a stopword list. As you say, UTF-8 does something similar in spirit but ASCII compatible and capable of fast synchronization if data is corrupted or truncated.
This is referred to as UTF-8 being "self-synchronizing". You can jump to the middle and find a codepoint boundary. You can read it backwards. You can read it forwards.
also, the redundancy means that you get a pretty good heuristic for "is this utf-8". Random data or other encodings are pretty unlikely to also be valid utf-8, at least for non-tiny strings
This isn't quite right. In invalid UTF8, a continuation byte can also emit a replacement char if it's the start of the byte sequence. Eg, `0b01100001 0b10000000 0b01100001` outputs 3 chars: a�a. Whether you're at the beginning of an output char depends on the last 1-3 bytes.
Wouldn't you only need to read backwards at most 3 bytes to see if you were currently at a continuation byte? With a max multi-byte size of 4 bytes, if you don't see a multi-byte start character by then you would know it's a single-byte char.
I wonder if a reason is similar though: error recovery when working with libraries that aren't UTF-8 aware. If you slice naively slice an array of UTF-8 bytes, a UTf-8 aware library can ignore malformed leading and trailing bytes and get some reasonable string out of it.
> Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.
Given four byte maximum, it's a similarly trivial algo for the other case you mention.
The main difference I see is that UTF8 increases the chance of catching and flagging an error in the stream. E.g., any non-ASCII byte that is missing from the stream is highly likely to cause an invalid sequence. Whereas with the other case you mention the continuation bytes would cause silent errors (since an ASCII character would be indecipherable from continuation bytes).
What do you mean? What would you suggest instead? Fixed-length encoding? It would take a looot of space given all the character variations you can have.
UTF-8 is indeed a genius design. But of course it’s crucially dependent on the decision for ASCII to use only 7 bits, which even in 1963 was kind of an odd choice.
Was this just historical luck? Is there a world where the designers of ASCII grabbed one more bit of code space for some nice-to-haves, or did they have code pages or other extensibility in mind from the start? I bet someone around here knows.
I don't know if this is the reason or if the causality goes the other way, but: it's worth noting that we didn't always have 8 general purpose bits. 7 bits + 1 parity bit or flag bit or something else was really common (enough so that e-mail to this day still uses quoted-printable [1] to encode octets with 7-bit bytes). A communication channel being able to transmit all 8 bits in a byte unchanged is called being 8-bit clean [2], and wasn't always a given.
In a way, UTF-8 is just one of many good uses for that spare 8th bit in an ASCII byte...
Not an expert but I happened to read about some of the history of this a while back.
ASCII has its roots in teletype codes, which were a development from telegraph codes like Morse.
Morse code is variable length, so this made automatic telegraph machines or teletypes awkward to implement. The solution was the 5 bit Baudot code. Using a fixed length code simplified the devices. Operators could type Baudot code using one hand on a 5 key keyboard. Part of the code's design was to minimize operator fatigue.
Baudot code is why we refer to the symbol rate of modems and the like in Baud btw.
Anyhow, the next change came with instead of telegraph machines directly signaling on the wire, instead a typewriter was used to create a punched tape of codepoints, which would be loaded into the telegraph machine for transmission. Since the keyboard was now decoupled from the wire code, there was more flexibility to add additional code points. This is where stuff like "Carriage Return" and "Line Feed" originate. This got standardized by Western Union and internationally.
By the time we get to ASCII, teleprinters are common, and the early computer industry adopted punched cards pervasively as an input format. And they initially did the straightforward thing of just using the telegraph codes. But then someone at IBM came up with a new scheme that would be faster when using punch cards in sorting machines. And that became ASCII eventually.
So zooming out here the story is that we started with binary codes, then adopted new schemes as technology developed. All this happened long before the digital computing world settled on 8 bit bytes as a convention. ASCII as bytes is just a practical compromise between the older teletype codes and the newer convention.
> But then someone at IBM came up with a new scheme that would be faster when using punch cards in sorting machines. And that became ASCII eventually.
Technically, the punch card processing technology was patented by inventor Herman Hollerith in 1884, and the company he founded wouldn't become IBM until 40 years later (though it was folded with 3 other companies into the Computing-Tabulating-Recording company in 1911, which would then become IBM in 1924).
To be honest though, I'm not clear how ASCII came from anything used by the punch card sorting machines, since it wasn't proposed until 1961 (by an IBM engineer, but 32 years after Hollerith's death). Do you know where I can read more about the progression here?
Fun fact: ASCII was a variable length encoding. No really! It was designed so that one could use overstrike to implement accents and umlauts, and also underline (which still works like that in terminals). I.e., á would be written a BS ' (or ' BS a), à would be written as a BS ` (or ` BS a), ö would be written o BS ", ø would be written as o BS /, ¢ would be written as c BS |, and so on and on. The typefaces were designed to make this possible.
This lives on in compose key sequences, so instead of a BS ' one types compose-' a and so on.
And this all predates ASCII: it's how people did accents and such on typewriters.
This is also why Spanish used to not use accents on capitals, and still allows capitals to not have accents: that would require smaller capitals, but typewriters back then didn't have them.
The use of 8-bit extensions of ASCII (like the ISO 8859-x family) was ubiquitous for a few decades, and arguably still is to some extent on Windows (the standard Windows code pages). If ASCII had been 8-bit from the start, but with the most common characters all within the first 128 integers, which would seem likely as a design, then UTF-8 would still have worked out pretty well.
The accident of history is less that ASCII happens to be 7 bits, but that the relevant phase of computer development happened to primarily occur in an English-speaking country, and that English text happens to be well representable with 7-bit units.
Most languages are well representable with 128 characters (7-bits) if you do not include English characters among those (eg. replace those 52 characters and some control/punctuation/symbols).
This is easily proven by the success of all the ISO-8859-*, Windows and IBM CP-* encodings, and all the *SCII (ISCII, YUSCII...) extensions — they fit one or more languages in the upper 128 characters.
It's mostly CJK out of large languages that fail to fit within 128 characters as a whole (though there are smaller languages too).
Many of the extended characters in ISO 8859-* can be implemented using pure ASCII with overstriking. ASCII was designed to support overstriking for this purpose. Overstriking was how one typed many of those characters on typewriters.
Historical luck. Though "luck" is probably pushing it in the way one might say certain math proofs are historically "lucky" based on previous work. It's more an almost natural consequence.
Before ASCII there was BCDIC, which was six bits and non-standardized (there were variants, just like technically there are a number of ASCII variants, with the common just referred to as ASCII these days).
BCDIC was the capital English letters plus common punctuation plus numbers. 2^6 is 64, and for capital letters + numbers, you have 36, plus a few common punctuation marks puts you around 50. IIRC the original by IBM was around 45 or something. Slash, period, comma, tc.
So when there was a decision to support lowercase, they added a bit because that's all that was necessary, and I think the printers around at the time couldn't print anything but something less than 128 characters anyway. There wasn't any ó or ö or anything printable, so why support it?
But eventually that yielded to 8-bit encodings (various ASCIIs like latin-1 extended, etc. that had ñ etc.).
Crucially, UTF-8 is only compatible with the 7-bit ASCII. All those 8-bit ASCIIs are incompatible with UTF-8 because they use the eighth bit.
7 bits isn't that odd. Bauddot was 5 bits, and found insufficient, so 6 bit codes were developed; they were found insufficient, so 7-bit ASCII was developed.
IBM had standardized 8-bit bytes on their System/360, so they developed the 8-bit EBCDIC encoding. Other computing vendors didn't have consistent byte lengths... 7-bits was weird, but characters didn't necessarily fit nicely into system words anyway.
I don't really say this to disagree with you, but I feel weird about the phrasing "found insufficient", as if we reevaluated and said 'oops'.
It's not like 5-bit codes forgot about numbers and 80% of punctuation, or like 6-bit codes forgot about having upper and lower case letters. They were clearly 'insufficient' for general text even as the tradeoff was being made, it's just that each bit cost so much we did it anyway.
The obvious baseline by the time we were putting text into computers was to match a typewriter. That was easy to see coming. And the symbols on a typewriter take 7 bits to encode.
Crucially, "the 7-bit coded character set" is described on page 6 using only seven total bits (1-indexed, so don't get confused when you see b7 in the chart!).
There is an encoding mechanism to use 8 bits, but it's for storage on a type of magnetic tape, and even that still is silent on the 8th bit being repurposed. It's likely, given the lack of discussion about it, that it was for ergonomic or technical purposes related to the medium (8 is a power of 2) rather than for future extensibility.
When ASCII was invented, 36-bit computers were popular, which would fit five ASCII characters with just one unused bit per 36-bit word. Before, 6-bit character codes were used, where a 36-bit word could fit six of them.
I'm not sure, but it does seem like a great bit of historical foresight. It stands as a lesson to anyone standardizing something: wanna use a 32 bit integer? Make it 31 bits. Just in case. Obviously, this isn't always applicable (e.g. sizes, etc..), but the idea of leaving even the smallest amount of space for future extensibility is crucial.
UTF-8 is as good as a design as could be expected, but Unicode has scope creep issues. What should be in Unicode?
Coming at it naively, people might think the scope is something like "all sufficiently widespread distinct, discrete glyphs used by humans for communication that can be printed". But that's not true, because
* It's not discrete. Some code points are for combining with other code points.
* It's not distinct. Some glyphs can be written in multiple ways. Some glyphs which (almost?) always display the same, have different code points and meanings.
* It's not all printable. Control characters are in there - they pretty much had to be due to compatibility with ASCII, but they've added plenty of their own.
I'm not aware of any Unicode code points that are animated - at least what's printable, is printable on paper and not just on screen, there are no marquee or blink control characters, thank God. But, who knows when that invariant will fall too.
By the way, I know of one utf encoding the author didn't mention, utf-7. Like utf-8, but assuming that the last bit wasn't safe to use (apparently a sensible precaution over networks in the 80s). My boss managed to send me a mail encoded in utf-7 once, that's how I know what it is. I don't know how he managed to send it, though.
the fact that there is seemingly no interest in fixing this, and if you want chinese and japanese in the same document, you're just fucked, forever, is crazy to me.
They should add separate code points for each variant and at least make it possible to avoid the problem in new documents. I've heard the arguments against this before, but the longer you wait, the worse the problem gets.
UTF-7 is/was mostly for email, which is not an 8-bit clean transport. It is obsolete and can't encode supplemental planes (except via surrogate pairs, which were meant for UTF-16).
There is also UTF-9, from an April Fools RFC, meant for use on hosts with 36-bit words such as the PDP-10.
The problem is the solution here. Add obscure stuff to the standard, and not everything will support it well. We got something decent in the end, different languages' scripts will mostly show up well on all sorts of computers. Apple's stuff like every possible combination of skin tone and gender family emoji might not.
Unicode wanted ability to losslessly roundtrip every other encoding, in order to be easy to partially adopt in a world where other encodings were still in use. It merged a bunch of different incomplete encodings that used competing approaches. That's why there are multiple ways of encoding the same characters, and there's no overall consistency to it. It's hard to say whether that was a mistake. This level of interoperability may have been necessary for Unicode to actually win, and not be another episode of https://xkcd.com/927
Why did Unicode want codepointwise round-tripping? One codepoint in a legacy encoding becoming two in Unicode doesn't seem like it should have been a problem. In other words, why include precomposed characters in Unicode?
> * It's not discrete. Some code points are for combining with other code points.
This isn't "scope creep". It's a reflection of reality. People were already constructing compositions like this is real life. The normalization problem was unavoidable.
One thing I always wonder: It is possible to encode a unicode codepoint with too much bytes. UTF-8 forbids these, only the shortest one is valid. E.g 00000001 is the same as 11000000 10000001.
So why not make the alternatives impossible by adding the start of the last valid option? So 11000000 10000001 would give codepoint 128+1 as values 0 to 127 are already covered by a 1 byte sequence.
The advantages are clear: No illegal codes, and a slightly shorter string for edge cases. I presume the designers thought about this, so what were the disadvantages? The required addition being an unacceptable hardware cost at the time?
UPDATE: Last bitsequence should of course be 10000001 and not 00000001. Sorry for that. Fixed it.
The siblings so far talk about the synchronizing nature of the indicators, but that's not relevant to your question. Your question is more of
Why is U+0080 encoded as c2 80, instead of c0 80, which is the lowest sequence after 7f?
I suspect the answer is
a) the security impacts of overlong encodings were not contemplated; lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings
b) utf-8 as standardized allows for encode and decode with bitmask and bitshift only. Your proposed encoding requires bitmask and bitshift, in addition to addition and subtraction
You can find a bit of email discussion from 1992 here [1] ... at the very bottom there's some notes about what became utf-8:
> 1. The 2 byte sequence has 2^11 codes, yet only 2^11-2^7
are allowed. The codes in the range 0-7f are illegal.
I think this is preferable to a pile of magic additive
constants for no real benefit. Similar comment applies
to all of the longer sequences.
The included FSS-UTF that's right before the note does include additive constants.
Oops yeah. One of my bit sequences is of course wrong and seems to have derailed this discussion. Sorry for that. Your interpretation is correct.
I've seen the first part of that mail, but your version is a lot longer. It is indeed quite convincing in declaring b) moot. And security was not that big of a thing then as it is now, so you're probalbly right
A variation of a) is comparing strings as UTF-8 byte sequences if overlong encodings are also accepted (before and/or later). This leads to situations where strings tested as unequal are actually equal in terms of code points.
See quectophoton's comment—the requirement that continuation bytes are always tagged with a leading 10 is useful if a parser is jumping in at a random offset—or, more commonly, if the text stream gets fragmented. This was actually a major concern when UTF-8 was devised in the early 90s, as transmission was much less reliable than it is today.
It also notes that UTF-8 protects against the dangers of NUL and '/' appearing in filenames, which would kill C strings and DOS path handling, respectively.
I assume you mean "11000000 10000001" to preserve the property that all continuation bytes start with "10"? [Edit: looks like you edited that in]. Without that property, UTF-8 loses self-synchronicity, the property that given a truncated UTF-8 stream, you can always find the codepoint boundaries, and will lose at most codepoint worth rather than having the whole stream be garbled.
In theory you could do it that way, but it comes at the cost of decoder performance. With UTF-8, you can reassemble a codepoint from a stream using only fast bitwise operations (&, |, and <<). If you declared that you had to subtract the legal codepoints represented by shorter sequences, you'd have to introduce additional arithmetic operations in encoding and decoding.
That would make the calculations more complicated and a little slower. Now you can do a few quick bit shifts. This was more of an issue back in the '90s when UTF-8 was designed and computers were slower.
Because then it would be impossible to tell from looking at a byte whether it is the beginning of a character or not, which is a useful property of UTF-8.
I have a love-hate relationship with backwards compatibility. I hate the mess - I love when an entity in a position of power is willing to break things in the name of advancement. But I also love the cleverness - UTF-8, UTF-16, EAN, etc. To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.
> To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.
It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1.
I hope we don’t regret this limitation some day. I’m not aware of any other material reason to disallow larger UTF-8 code units.
That isn't really a case of UTF-8 sacrificing anything to be compatible with UTF-16. It's Unicode, not UTF-8 that made the sacrifice: Unicode is limited to 21 bits due to UTF-16. The UTF-8 design trivially extends to support 6 byte long sequences supporting up to 31-bit numbers. But why would UTF-8, a Unicode character encoding, support code points which Unicode has promised will never and can never exist?
> It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1z
Yes, it is 'truncated' to the "UTF-16 accessible range":
It's always dangerous to stick one's neck out and say "[this many bits] ought to be enough for anybody", but I think it's very unlikely we'll ever run out of UTF-8 sequences. UTF-8 can represent about 1.1 million code points, of which we've assigned about 160,000 actual characters, plus another ~140,000 in the Private Use Area, which won't expand. And that's after getting nearly all of the world's known writing systems: the last several Unicode updates have added a few thousand characters here and there for very obscure and/or ancient writing systems, but those won't go on forever (and things like emojis rarely only get a handful of new code points per update, because most new emojis are existing code points with combining characters).
If I had to guess, I'd say we'll run out of IPv6 addresses before we run out of unassigned UTF-8 sequences.
> It sacrifices the ability to encode more than 21 bits
No, UTF-8's design can encode up to 31 bits of codepoints. The limitation to 21 bits comes from UTF-16, which was then adopted for UTF-8 too. When UTF-16 dies we'll be able to extend UTF-8 (well, compatibility will be a problem).
That limitation will be trivial to lift once UTF-16 compatibility can be disregarded. This won’t happen soon, of course, given JavaScript and Windows, but the situation might be different in a hundred or thousand years. Until then, we still have a lot of unassigned code points.
In addition, it would be possible to nest another surrogate-character-like scheme into UTF-16 to support a larger character set.
> I love when an entity in a position of power is willing to break things in the name of advancement.
It's less fun when you have things that need to keep working break because someone felt like renaming a parameter, or that a part of the standard library looks "untidy"
> To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.
There were apps that completely rejected non-7-bit data back in the day. Backwards compatibility wasn't the only point. The point of UTF-8 is more (IMO) that UTF-32 is too bulky, UCS-2 was insufficient, UTF-16 was an abortion, and only UTF-8 could have the right trade-offs.
Yeah I honestly don't know what I would change. Maybe replace some of the control characters with more common characters to save a tiny bit of space, if we were to go completely wild and break Unicode backward compatibility too. As a generic multi byte character encoding format, it seems completely optimal even in isolation.
Read that a few times back then as well, but that and other pieces of the day never told you how to actually write a program that supported Unicode. Just facts about it.
So I went around fixing UnicodeErrors in Python at random, for years, despite knowing all that stuff. It wasn't until I read Batchelder's piece on the "Unicode Sandwich," about a decade later that I finally learned how to write a program to support it properly, rather than playing whack-a-mole.
UTF-16 made lots of sense at the time because Unicode thought "65,536 characters will be enough for anybody" and it retains the 1:1 relationship between string elements and characters that everyone had assumed for decades. I.e., you can treat a string as an array of characters and just index into it with an O(1) operation.
As Unicode (quickly) evolved, it turned out not that only are there WAY more than 65,000 characters, there's not even a 1:1 relationship between code points and characters, or even a single defined transformation between glyphs and code points, or even a simple relationship between glyphs and what's on the screen. So even UTF-32 isn't enough to let you act like it's 1980 and str[3] is the 4th "character" of a string.
So now we have very complex string APIs that reflect the actual complexity of how human language works...though lots of people (mostly English-speaking) still act like str[3] is the 4th "character" of a string.
UTF-8 was designed with the knowledge that there's no point in pretending that string indexing will work. Windows, MacOS, Java, JavaScript, etc. just missed the boat by a few years and went the wrong way.
I think more effort should have been made to live with 65,536 characters. My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis. I think that adding emojis to unicode is going to be seen a big mistake. We already have enough network bandwith to just send raster graphics for images in most cases. Cluttering the unicode codespace with emojis is pointless.
Yeah, Java and Windows NT3.1 had really bad timing. Both managed to include Unicode despite starting development before the Unicode 1.0 release, but both added unicode back when Unicode was 16 bit and the need for something like UTF-8 was less clear
NeXTstep was also UTF-16 through OpenStep 4.0, IIRC. Apple was later able to fix this because the string abstraction in the standard library was complete enough no one actually needed to care about the internal representation, but the API still retains some of the UTF-16-specific weirdnesses.
> It was so easy once we saw it that there was no reason to keep the placemat for notes, and we left it behind. Or maybe we did bring it back to the lab; I'm not sure. But it's gone now.
UTF-8 is great and I wish everything used it (looking at you JavaScript). But it does have a wart in that there are byte sequences which are invalid UTF-8 and how to interpret them is undefined. I think a perfect design would define exactly how to interpret every possible byte sequence even if nominally "invalid". This is how the HTML5 spec works and it's been phenomenally successful.
For security reasons, the correct answer on how process invalid UTF-8 is (and needs to be) "throw away the data like it's radioactive, and return an error." Otherwise you leave yourself wide open to validation bypass attacks at many layers of your stack.
This is rarely the correct thing to do. Users don't particularly like it if you refuse to process a document because it has an error somewhere in there.
Even for identifiers you probably want to do all kinds of normalization even beyond the level of UTF-8 so things like overlong sequences and other errors are really not an inherent security issue.
> This is how the HTML5 spec works and it's been phenomenally successful.
Unicode does have a completely defined way to interpret invalid UTF-8 byte sequences by replacing them with the U+FFFD ("replacement character"). You'll see it used (for example) in browsers all the time.
Mandating acceptance for every invalid input works well for HTML because it's meant to be consumed (primarily) by humans. It's not done for UTF-8 because in some situations it's much more useful to detect and report errors instead of making an automatic correction that can't be automatically detected after the fact.
> But it does have a wart in that there are byte sequences which are invalid UTF-8 and how to interpret them is undefined.
This is not a wart. And how to interpret them is not undefined -- you're just not allowed to interpret them as _characters_.
There is right now a discussion[0] about adding a garbage-in/garbage-out mode to jq/jaq/etc that allows them to read and output JSON with invalid UTF-8 strings representing binary data in a way that round-trips. I'm not for making that the default for jq, and you have to be very careful about this to make sure that all the tools you use to handle such "JSON" round-trip the data. But the clever thing is that the proposed changes indeed do not interpret invalid byte sequences as character data, so they stay within the bounds of Unicode as long as your terminal (if these binary strings end up there) and other tools also do the same.
I remember learning Japanese in the early 2000s and the fun of dealing with multiple encodings for the same language: JIS, Shift-JIS, and EUC. As late as 2011 I had to deal with processing a dataset encoded under EUC in Python 2 for a graduate-level machine learning course where I worked on a project for segmenting Japanese sentences (typically there are no spaces in Japanese sentences).
UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!
I live in Japan and I still receive the random email or work document encoded in Shit-JIS. Mojibake is not as common as it once was, but still a problem.
I worked on a site in the late 90s which had news in several Asian languages, including both simplified and traditional Chinese. We had a partner in Hong Kong sending articles and being a stereotypical monolingual American I took them at their word that they were sending us simplified Chinese and had it loaded into our PHP app which dutifully served it with that encoding. It was clearly Chinese so I figured we had that feed working.
A couple of days later, I got an email from someone explaining that it was gibberish — apparently our content partner who claimed to be sending GB2312 simplified Chinese was in fact sending us Big5 traditional Chinese so while many of the byte values mapped to valid characters it was nonsensical.
If you want to delve deeper into this topic and like the Advent of Code format, you're in luck: i18n-puzzles[1] has a bunch of puzzles related to text encoding that drill how UTF-8 (and other variants such as UTF-16) work into your brain.
Meanwhile Shift-JIS has a bad design, since the second byte of a character can be any ASCII character 0x40-0x9E. This includes brackets, backslash, caret, backquote, curly braces, pipe, and tilde. This can cause a path separator or math operator to appear in text that is encoded as Shift-JIS but interpreted as plain ASCII.
UTF-8 basically learned from the mistakes of previous encodings which allowed that kind of thing.
I need to call out a myth about UTF-8. Tools built to assume UTF-8 are not backwards compatible with ASCII. An encoding INCLUDES but also EXCLUDES. When a tool is set to use UTF-8, it will process an ASCII stream, but it will not filter out non-ASCII.
I still use some tools that assume ASCII input. For many years now, Linux tools have been removing the ability to specify default ASCII, leaving UTF-8 as the only relevant choice. This has caused me extra work, because if the data processing chain goes through these tools, I have to manually inspect the data for non-ASCII noise that has been introduced. I mostly use those older tools on Windows now, because most Windows tools still allow you to set default ASCII.
The usual statement isn't that UTF-8 is backwards compatible with ASCII (it's obvious that any 8-bit encoding wouldn't be; that's why we have UTF-7!). It's that UTF-8 is backwards compatible with tools that are 8-bit clean.
Yes, the myth I was pointing out is based on loose terminology. It needs to be made clear that "backwards compatible" means that UTF-8 based tools can receive but are not constrained to emit valid ASCII. I see a lot of comments implying that UTF-8 can interact with an ASCII ecosystem without causing problems. Even worse, it seems most Linux developers believe there is no longer a need to provide a default ASCII setting if they have UTF-8.
While the backward compatibility of utf-8 is nice, and makes adoption much easier, the backward compatibility does not come at any cost to the elegance of the encoding.
In other words, yes it's backward compatible, but utf-is also compact and elegant even without that.
UTF-8 also enables this mindblowing design for small string optimization - if the string has 24 bytes or less it is stored inline, otherwise it is stored on the heap (with a pointer, a length, and a capacity - also 24 bytes)
Karpathy's "Let's build the GPT Tokenizer" also contains a good introduction to Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32 in the first 20 minutes: https://www.youtube.com/watch?v=zduSFxRajkE
It's worth noting that Stallman had earlier proposed a design for Emacs "to handle all the world's alphabets and word signs" with similar requirements to UTF-8. That was the etc/CHARACTERS file in Emacs 18.59 (1990). The eventual international support implemented in Emacs 20's MULE was based on ISO-2022, which was a reasonable choice at the time, based on earlier Japanese work. (There was actually enough space in the MULE encoding to add UTF-8, but the implementation was always going to be inefficient with the number of bytes at the top of the code space.)
A little off topic but amidst a lot of discussion of UTF-8 and its ASCII compatibility property I'm going to mention my one gripe with ASCII, something I never see anyone talking about, something I've never talked about before:
The damn 0x7f character. Such an annoying anomaly in every conceivable way. It would be much better if it was some other proper printable punctuation or punctuation adjacent character. A copyright character. Or a pi character or just about anything other than what it already is. I have been programming and studying packet dumps long enough that I can basically convert hex to ASCII and vice versa in my head but I still recoil at this anomalous character (DELETE? is that what I should call it?) every time.
Much better in every way except the one that mattered most: being able to correct punching errors in a paper tape without starting over.
I don't know if you have ever had to use White-Out to correct typing errors on a typewriter that lacked the ability natively, but before White-Out, the only option was to start typing the letter again, from the beginning.
0x7f was White-Out for punched paper tape: it allowed you to strike out an incorrectly punched character so that the message, when it was sent, would print correctly. ASCII inherited it from the Baudot–Murray code.
It's been obsolete since people started punching their tapes on computers instead of Teletypes and Flexowriters, so around 01975, and maybe before; I don't know if there was a paper-tape equivalent of a duplicating keypunch, but that would seem to eliminate the need for the delete character. Certainly TECO and cheap microcomputers did.
Related: Why is there a “small house” in IBM's Code page 437? (glyphdrawing.club) [1]. There are other interesting articles mentioned in the discussion. m_walden probably would comment here himself
I once saw a good byte encoding for Unicode: 7 bit for data, 1 for continuation/stop. This gives 21 bit for data, which is enough for the whole range. ASCII compatible, at most 3 bytes per character. Very simple: the description is sufficient to implement it.
Probably a good idea, but when UTF-8 was designed the Unicode committee had not yet made the mistake of limiting the character range to 21 bits. (Going into why it's a mistake would make this comment longer than it's worth, so I'll only expound on it if anyone asks me to). And at this point it would be a bad idea to switch away from the format that is now, finally, used in over 99% of all documents online. The gain would be small (not zero, but small) and the cost would be immense.
It took time for UTF-8 to make sense. Struggling with how large everything was was a real problem just after the turn of the century. Today it makes more sense because capacity and compute power is much greater but back then it was a huge pain in the ass.
It made much more sense than UTF-16 or any of the existing multi-byte character sets, and the need for more than 256 characters had been apparent for decades. Seeing its simplicity, it made perfect sense almost immediately.
No, it didn't. Not at the time. Like I said processing and storage were a pain back around the 2000-ish time. Windows supported UCS-2 (predecessor to UTF-16) which was fixed width 16-bit and faster to decode and encode, and since most of the world was Windows at the time, it made more sense to use UCS-2. Also, the world was only beginning to be more connected so UTF-8 seemed overkill.
NOW in hindsight it makes more sense to use UTF-8 but it wasn't clear back 20 years ago it was worth it.
Even for varints (you could probably drop the intermediate prefixes for that). There are many examples of using SIMD to decode utf-8, where-as the more common protobuf scheme is known to be hostile to SIMD and the branch predictor.
Yeah, protobuf's varint are quite hard to decode with current SIMD instructions, but it would be quite easy, if we get element wise pext/pdep instructions in the future. (SVE2 already has those, but who has SVE2?)
I have always wondered - what if the utf-8 space is filled up? Does it automatically promote to having a 5th byte? Is that part of the spec? Or are we then talking about utf-16?
UTF-8 can represent up to 1,114,112 characters in Unicode. And in Unicode 15.1 (2023, https://www.unicode.org/versions/Unicode15.1.0/) a total of 149,813 characters are included, which covers most of the world's languages, scripts, and emojis. That leaves a 960K space for future expansion.
Wait until we get to know another specie then we will not just fill that Unicode space, but we will ditch any utf-16 compatibility so fast that will make your head spin on a snivel.
Imagine the code points we'll need to represent an alien culture :).
If we ever needed that many characters, yes the most obvious solution would be a fifth byte. The standard would need to be explicitly extended though.
But that would probably require having encountered literate extraterrestrial species to collect enough new alphabets to fill up all the available code points first. So seems like it would be a pretty cool problem to have.
utf-8 is just an encoding of unicode. UTF-8 is specified in a way so that it can encode all unicode codepoints up to 0x10FFFF. It doesn't extend further. And UTF-16 also encodes unicode in a similar same way, it doesn't encode anything more.
So what would need to happen first would be that unicode decides they are going to include larger codepoints. Then UTF-8 would need to be extended to handle encoding them. (But I don't think that will happen.)
It seems like Unicode codepoints are less than 30% allocated, roughly. So there's 70% free space..
---
Think of these three separate concepts to make it clear. We are effectively dealing with two translations - one from the abstract symbol to defined unicode code point. Then from that code point we use UTF-8 to encode it into bytes.
1. The glyph or symbol ("A")
2. The unicode code point for the symbol (U+0041 Latin Capital Letter A)
3. The utf-8 encoding of the code point, as bytes (0x41)
As an aside: UTF-8, as originally specified in RFC 2279, could encode codepoints up to U+7FFFFFFF (using sequences of up to six bytes). It was later restricted to U+10FFFF to ensure compatibility with UTF-16.
I take it you could choose to encode a code point using a larger number of bytes than are actually needed? E.g., you could encode "A" using 1, 2, 3 or 4 bytes?
Because if so: I don't really like that. It would mean that "equal sequence of code points" does not imply "equal sequence of encoded bytes" (the converse continues to hold, of course), while offering no advantage that I can see.
UTF-8 is a undeniably a good answer, but to a relatively simple bit twiddling / variable len integer encoding problem in a somewhat specific context.
I realize that hindsight is 20/20, and time were different, but lets face it: "how to use an unused top bit to best encode larger number representing Unicode" is not that much of challenge, and the space of practical solutions isn't even all that large.
Except that there were many different solutions before UTF-8, all of which sucked really badly.
UTF-8 is the best kind of brilliant. After you've seen it, you (and I) think of it as obvious, and clearly the solution any reasonable engineer would come up with. Except that it took a long time for it to be created.
I just realised that all latin text is wasting 12% of storage/memory/bandwidth with MSB zero. At least is compresses well. Are there any technology that utilizes 8th bit for something useful, e.g. error checking?
See mort96's comments about 7-bit ASCII and parity bits (https://news.ycombinator.com/item?id=45225911). Kind of archaic now, though - 8-bit bytes with the error checking living elsewhere in the stack seems to be preferred.
One aspect of Unicode that is probably not obvious is that with Unicode it is possible to keep using old encodings just fine. You can always get their Unicode equivalents, this is what Unicode was about. Otherwise just keep the data as is, tagged with the encoding. This nicely extends to filesystem "encodings" too.
For example, modern Python internally uses three forms (Latin-1, UTF-16 and 32) depending on the contents of the string. But this can be done for all encodings and also for things like file names that do not follow Unicode. The Unicode standard does not dictate everything must take the same form; it can be used to keep existing forms but make them compatible.
UTF-8 is a nice extension for ASCII from the compatibility point of view, but it might be not the most compact especially if the text is not English like. Also, the variable character length makes it inconvenient to work with strings unless they are parsed/saved into/from 2/4 byte char array.
Nice article, thank you. I love UTF-8, but I only advocate it when used with a BOM. Otherwise, an application may have no way of knowing that it is UTF-8, and that it needs to be saved as UTF-8.
Imagine selecting New/Text Document in an environment like File Explorer on Windows: if the initial (empty) file has a BOM, any app will know that it is supposed to be saved again as UTF-8 once you start working on it. But with no BOM, there is no such luck, and corruption may be just around the corner, even when the editor tries to auto-detect the encoding (auto-detection is never easy or 100% reliable, even for basic Latin text with "special" characters)
The same can happen to a plain ASCII file (without a BOM): once you edit it, and you add, say, some accented vowel, the chaos begins. You thought it was Italian, but your favorite text editor might conclude it's Vietnamese! I've even seen Notepad switch to a different default encoding after some Windows updates.
So, UTF-8 yes, but with a BOM. It should be the default in any app and operating system.
The fact that you advocate using a BOM with UTF-8 tells me that you run Windows. Any long-term Unix user has probably seen this error message before (copy and pasted from an issue report I filed just 3 days ago):
bash: line 1: #!/bin/bash: No such file or directory
If you've got any experience with Linux, you probably suspect the problem already. If your only experience is with Windows, you might not realize the issue. There's an invisible U+FEFF lurking before the `#!`. So instead of that shell script starting with the `#!` character pair that tells the Linux kernel "The application after the `#!` is the application that should parse and run this file", it actually starts with `<FEFF>#!`, which has no meaning to the kernel. The way this script was invoked meant that Bash did end up running the script, with only one error message (because the line did not start with `#` and therefore it was not interpreted as a Bash comment) that didn't matter to the actual script logic.
This is one of the more common problems caused by putting a BOM in UTF-8 files, but there are others. The issue is that adding a BOM, as can be seen here, *breaks the promise of UTF-8*: that a UTF-8 file that contains only codepoints below U+007F can be processed as-is, and legacy logic that assumes ASCII will parse it correctly. The Linux kernel is perfectly aware of UTF-8, of course, as is Bash. But the kernel logic that looks for `#!`, and the Bash logic that look for a leading `#` as a comment indicator to ignore the line, do *not* assume a leading U+FEFF can be ignored, nor should they (for many reasons).
What should happen is that these days, every application should assume UTF-8 if it isn't informed of the format of the file, unless and until something happens to make it believe it's a different format (such as reading a UTF-16 BOM in the first two bytes of the file). If a file fails to parse as UTF-8 but there are clues that make another encoding sensible, reparsing it as something else (like Windows-1252) might be sensible.
But putting a BOM in UTF-8 causes more problems than it solves, because it *breaks* the fundamental promise of UTF-8: ASCII compatibility with Unicode-unaware logic.
I like your answer, and the others too, but I suspect I have an even worse problem than running Windows: I am an Amiga user :D
The Amiga always used all 8 bits (ISO-8859-1 by default), so detecting UTF-8 without a BOM is not so easy, especially when you start with an empty file, or in some scenario like the other one I mentioned.
And it's not that Macs and PCs don't have 8-bit legacy or coexistence needs. What you seem to be saying is that compatibility with 7-bit ASCII is sacred, whereas compatibility with 8-bit text encodings is not important.
Since we now have UTF-8 files with BOMs that need to be handled anyway, would it not be better if all the "Unicode-unaware" apps at least supported the BOM (stripping it, in the simplest case)?
Also some XML parsers I used choked on UTF-8 BOMs. Not sure if valid XML is allowed to have anything other than clean ASCII in the first few characters before declaring what the encoding is?
I respectfully disagree. The BOM is a Windows-specific idiosyncrasy resulting from its early adoption of UTF-16. In the Unix world, a BOM is unexpected and causes problems with many programs, such as GCC, PHP and XML parsers. Don't use it!
The correct approach is to use and assume UTF-8 everywhere. 99% of websites use UTF-8. There is no reason to break software by adding a BOM.
You do not need a BOM for UTF-8. Ever. Byte order issues are not a problem for UTF-8 because UTF-8 is manipulated as a string of _bytes_, not as a string of 16-bit or 32-bit code units.
In a pure UTF-8 world we would not need it, sure. I get that point. But what do you want to do with 40+ years worth of text files that came after 7-bit ASCII, where they may coexist with UTF-8? If we want to preserve our past, the practical solution is that the OS or app has a default character set for 8-bit text encoding, in addition to supporting (and using as a default) UTF-8.
I also agree that "BOM" is the wrong name for an UTF-8... BOM. Byte order is not the issue. But still, it's a header that says that the file, even if empty, is UTF-8. Detecting an 8-bit legacy character set is much more difficult that recognizing (skipping) a BOM.
I made an interactive one since I couldn't find anything that allows individually set/unset bits and see what happens. Here: https://utf8-playground.netlify.app/
UTF-8 is a neat way of encoding 1M+ code points in 8 bit bytes, and including 7 bit ASCII. If only unicode were as neat - sigh. I guess it's way too late to flip unicode versions and start again avoiding the weirdness.
The story is that Ken and Rob were at a diner when Ken gave structure to it and wrote the initial encode/decode functions on napkins. UTF-8 is so simple yet it required a complex mind to do it.
Love reading explorations of structures and technical phenomena that are basically the digital equivalent of oxygen in their ubiquity and in how we take them for granted
UTF-8 contributors are some of our modern day unsung heroes. The design is brilliant but the dedication to encode every single way humans communicate via text into a single standard, and succeed at it, is truly on another level.
Most other standards just do the xkcd thing: "now there's 15 competing standards"
UTF-8 was a huge improvement for sure, but I was, 20-25 years ago, working with LATIN-1 (so 8 bit charcters) which was a struggle in the years it took for everything to switch to UTF-8, the compatibility with ASCII meant you only really notice something was wrong when the data had special characters not representable in ASCII but valid LATIN-1. So perhaps breaking backwards compatibility would've resulted in less data corruption overall.
Because the original design assumed that 16 bits are enough to encode everything worth encoding, hence UCS2 (not UTF-16, yet) being the easiest and most straightforward way to represent things.
No. UTF-8 is for encoding text, so we don't need to care about it being variable length because text was already variable length.
The network addresses aren't variable length, so if you decide "Oh IPv6 is variable length" then you're just making it worse with no meaningful benefit.
The IPv4 address is 32 bits, the IPv6 address is 128 bits. You could go 64 but it's much less clear how to efficiently partition this and not regret whatever choices you do make in the foreseeable future. The extra space meant IPv6 didn't ever have those regrets.
It suits a certain kind of person to always pay $10M to avoid the one-time $50M upgrade cost. They can do this over a dozen jobs in twenty years, spending $200M to avoid $50M cost and be proud of saving money.
You reserve 32 bits of these 128 just like UTF-8 did for theirs for ASCII for backward-compatibility, and request backward compatible fall-back from user interfaces, I hope it clears it
well you have to click around a bit and be prepared to look at the other pages in Pabels series of posts … I linked to this one since I felt it chimes well with the OP
That's a problem with programming languages having inconsistent definitions of length. They could be like Swift where the programmer has control over what counts as length one. Or they could decide that the problem shouldn't be solved by the language but by libraries like ICU.
> Another one is the ISO/IEC 8859 encodings are single-byte encodings that extend ASCII to include additional characters, but they are limited to 256 characters.
ISO 2022 allowed you to use control codes to switch between ISO 8859 character sets though, allowing for mixed script text streams.
I specialize in protocol design, unfortunately. A while ago I had to code some Unicode conversion routines from scratch and I must say I absolutely admire UTF-8. Unicode per se is a dumpster fire, likely because of objective reasons. Dealing with multiple Unicode encodings is a minefield. I even made an angry write-up back then https://web.archive.org/web/20231001011301/http://replicated...
UTF-8 made it all relatively neat back in the day.
There are still ways to throw a wrench into the gears. For example, how do you handle UTF-8 encoded surrogate pairs? But at least one can filter that out as suspicious/malicious behavior.
> For example, how do you handle UTF-8 encoded surrogate pairs?
Surrogate pairs aren’t applicable to UTF-8. That part of Unicode block is just invalid for UTF-8 and should be treated as such (parsing error or as invalid characters etc).
Maybe as to emojis, but otherwise, no, Unicode is not a dumpster fire. Unicode is elegant, and all the things that people complain about in Unicode are actually problems in human scripts.
UTF-16 is a hack that was invented when it became clear that UCS-2 wasn't gonna work (65536 codepoints was not enough for everybody).
Almost the entire world could have ignored it if not for Microsoft making the wrong choice with Windows NT and then stubbornly insisting that their wrong choice was indeed correct for a couple of decades.
There was a long phase where some parts of Windows understood (and maybe generated) UTF-16 and others only UCS-2.
Besides Microsoft, plenty of others thought UTF-16 to be a good idea. The Haskell Text type used to be based on UTF-16; it only switched to UTF-8 a few years ago. Java still uses UTF-16, but with an ad hoc optimization called CompactStrings to use ISO-8859-1 where possible.
UTF8 is a horrible design.
The only reason it was widely adopted was backwards compatibility with ASCII.
There are large number of invalid byte combinations that have to be discarded.
Parsing forward is complex even before taking invalid byte combinations in account and parsing backwards is even worse.
Compare that to UTF16 where parsing forward and backwards are simpler than UTF8 and if there is invalid surrogate combination, one can assume it is valid UCS2 char.
UTF-16 is an abomination. It's only easy to parse because it's artificially limited to 1 or 2 code units. It's an ugly hack that requires reserving 2048 code points ("surrogates") from the Unicode table just for the encoding itself.
It's also the reason why Unicode has a limit of about 1.1 million code points: without UTF-16, we could have over 2 billion (which is the UTF-8 limit).
Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.
If the characters were instead encoded like EBML's variable size integers[1] (but inverting 1 and 0 to keep ASCII compatibility for the single-byte case), and you do a random seek, it wouldn't be as easy (or maybe not even possible) to know if you landed on the beginning of a character or in one of the `xxxx xxxx` bytes.
[1]: https://www.rfc-editor.org/rfc/rfc8794#section-4.4
Right. That's one of the great features of UTF-8. You can move forwards and backwards through a UTF-8 string without having to start from the beginning.
Python has had troubles in this area. Because Python strings are indexable by character, CPython used wide characters. At one point you could pick 2-byte or 4-byte characters when building CPython. Then that switch was made automatic at run time. But it's still wide characters, not UTF-8. One emoji and your string size quadruples.
I would have been tempted to use UTF-8 internally. Indices into a string would be an opaque index type which behaved like an integer to the extent that you could add or subtract small integers, and that would move you through the string. If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated. That's an unusual case. All the standard operations, including regular expressions, can work on a UTF-8 representation with opaque index objects.
PyCompactUnicodeObject was introduced with Python 3.3, and uses UTF-8 internally. It's used whenever both size and max code point are known, which is most cases where it comes from a literal or bytes.decode() call. Cut memory usage in typical Django applications by 2/3 when it was implemented.
https://peps.python.org/pep-0393/
I would probably use UTF-8 and just give up on O(1) string indexing if I were implementing a new string type. It's very rare to require arbitrary large-number indexing into strings. Most use-cases involve chopping off a small prefix (eg. "hex_digits[2:]") or suffix (eg. "filename[-3:]"), and you can easily just linear search these with minimal CPU penalty. Or they're part of library methods where you want to have your own custom traversals, eg. .find(substr) can just do Boyer-Moore over bytes, .split(delim) probably wants to do a first pass that identifies delimiter positions and then use that to allocate all the results at once.
8 replies →
This is Python; finding new ways to subscript into things directly is a graduate student’s favorite pastime!
In all seriousness I think that encoding-independent constant-time substring extraction has been meaningful in letting researchers outside the U.S. prototype, especially in NLP, without worrying about their abstractions around “a 5 character subslice” being more complicated than that. Memory is a tradeoff, but a reasonably predictable one.
1 reply →
Indices into a Unicode string is a highly unusual operation that is rarely needed. A string is Unicode because it is provided by the user or a localized user-facing string. You don't generally need indices.
Programmer strings (aka byte strings) do need indexing operations. But such strings usually do not need Unicode.
1 reply →
Your solution is basically what Swift does. Plus they do the same with extended grapheme clusters (what a human would consider distinct characters mostly), and that’s the default character type instead of Unicode code point. Easily the best Unicode string support of any programming language.
Variable width encodings like UTF-8 and UTF-16 cannot be indexed in O(1), only in O(N). But this is not really a problem! Instead of indexing strings we need to slice them, and generally we read them forwards, so if slices (and slices of slices) are cheap, then you can parse textual data without a problem. Basically just keep the indices small and there's no problem.
3 replies →
> If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated.
What conversion rule do you want to use, though? You either reject some values outright, bump those up or down, or else start with a character index that requires an O(N) translation to a byte index.
"Unicode" aka "wide characters" is the dumbest engineering debacle of the century.
> ascii and codepage encodings are legacy, let's standardize on another forwards-incompatible standard that will be obsolete in five years > oh, and we also need to upgrade all our infrastructure for this obsolete-by-design standard because we're now keeping it forever
3 replies →
VLQ/LEB128 are a bit better than the EBML's variable size integers. You test the MSB in the byte - `0` means it's the end of a sequence and the next byte is a new sequence. If the MSB is `1`, to find the start of the sequence you walk back until you find the first zero MSB at the end of the previous sequence (or the start of the stream). There are efficient SIMD-optimized implementations of this.
The difference between VLQ and LEB128 is endianness, basically whether the zero MSB is the start or end of a sequence.
It's not self-synchronizing like UTF-8, but it's more compact - any unicode codepoint can fit into 3 bytes (which can encode up to 0x1FFFFF), and ASCII characters remain 1 byte. Can grow to arbitrary sizes. It has a fixed overhead of 1/8, whereas UTF-8 only has overhead of 1/8 for ASCII and 1/3 thereafter. Could be useful compressing the size of code that uses non-ASCII, since most of the mathematical symbols/arrows are < U+3FFF. Also languages like Japanese, since Katakana and Hiragana are also < U+3FFF, and could be encoded in 2 bytes rather than 3.
Unfortunately, VLQ/LEB128 is slow to process due to all the rolling decision points (one decision point per byte, with no ability to branch predict reliably). It's why I used a right-to-left unary code in my stuff: https://github.com/kstenerud/bonjson/blob/main/bonjson.md#le...
The full value is stored little endian, so you simply read the first byte (low byte) in the stream to get the full length, and it has the exact same compactness of VLQ/LEB128 (7 bits per byte).
Even better: modern chips have instructions that decode this field in one shot (callable via builtin):
https://github.com/kstenerud/ksbonjson/blob/main/library/src...
After running this builtin, you simply re-read the memory location for the specified number of bytes, then cast to a little-endian integer, then shift right by the same number of bits to get the final payload - with a special case for `00000000`, although numbers that big are rare. In fact, if you limit yourself to max 56 bit numbers, the algorithm becomes entirely branchless (even if your chip doesn't have the builtin).
https://github.com/kstenerud/ksbonjson/blob/main/library/src...
It's one of the things I did to make BONJSON 35x faster to decode/encode compared to JSON.
https://github.com/kstenerud/bonjson
If you wanted to maintain ASCII compatibility, you could use a 0-based unary code going left-to-right, but you lose a number of the speed benefits of a little endian friendly encoding (as well as the self-synchronization of UTF-8 - which admittedly isn't so important in the modern world of everything being out-of-band enveloped and error-corrected). But it would still be a LOT faster than VLQ/LEB128.
4 replies →
That's assuming the text is not corrupted or maliciously modified. There were (are) _numerous_ vulnerabilities due to parsing/escaping of invalid UTF-8 sequences.
Quick googling (not all of them are on-topic tho):
https://www.rapid7.com/blog/post/2025/02/13/cve-2025-1094-po...
https://www.cve.org/CVERecord/SearchResults?query=utf-8
This tendency of requirement overloading, for what can otherwise be a simple solution for a simple problem, is the bane of engineering. In this case, if security is important, it can be addressed separately, e.g. for the underlying text treated as an abstract information block that has to be packaged with corresponding error codes then checked for integrity before consumption. The UTF-8 encoding/decoding process itself doesn't necessarily have to answer the security concerns. Please let the solutions be simple, whenever they can be.
I was just wondering a similar thing: If 10 implies start of character, doesn't that require 10 to never occur inside the other bits of a character?
3 replies →
It's not uncommon when you want variable length encodings to write the number of extension bytes used in unary encoding
https://en.wikipedia.org/wiki/Unary_numeral_system
and also use whatever bits are left over encoding the length (which could be in 8 bit blocks so you write 1111/1111 10xx/xxxx to code 8 extension bytes) to encode the number. This is covered in this CS classic
https://archive.org/details/managinggigabyte0000witt
together with other methods that let you compress a text + a full text index for the text into less room than text and not even have to use a stopword list. As you say, UTF-8 does something similar in spirit but ASCII compatible and capable of fast synchronization if data is corrupted or truncated.
This is referred to as UTF-8 being "self-synchronizing". You can jump to the middle and find a codepoint boundary. You can read it backwards. You can read it forwards.
also, the redundancy means that you get a pretty good heuristic for "is this utf-8". Random data or other encodings are pretty unlikely to also be valid utf-8, at least for non-tiny strings
This isn't quite right. In invalid UTF8, a continuation byte can also emit a replacement char if it's the start of the byte sequence. Eg, `0b01100001 0b10000000 0b01100001` outputs 3 chars: a�a. Whether you're at the beginning of an output char depends on the last 1-3 bytes.
> outputs 3 chars
You mean codepoints or maybe grapheme clusters?
Anyways yeah it’s a little more complicated but the principle of being able to truncate a string without splitting a codepoint in O(1) is still useful
1 reply →
Wouldn't you only need to read backwards at most 3 bytes to see if you were currently at a continuation byte? With a max multi-byte size of 4 bytes, if you don't see a multi-byte start character by then you would know it's a single-byte char.
I wonder if a reason is similar though: error recovery when working with libraries that aren't UTF-8 aware. If you slice naively slice an array of UTF-8 bytes, a UTf-8 aware library can ignore malformed leading and trailing bytes and get some reasonable string out of it.
It’s not always possible to read backwards.
3 replies →
> Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.
Given four byte maximum, it's a similarly trivial algo for the other case you mention.
The main difference I see is that UTF8 increases the chance of catching and flagging an error in the stream. E.g., any non-ASCII byte that is missing from the stream is highly likely to cause an invalid sequence. Whereas with the other case you mention the continuation bytes would cause silent errors (since an ASCII character would be indecipherable from continuation bytes).
Encoding gurus-- am I right?
[dead]
> so you can easily find the beginning of the next or previous character.
It is not true [1]. While it is not UTF-8 problem per se, it is a problem of how UTF-8 is being used.
[1] https://paulbutler.org/2025/smuggling-arbitrary-data-through...
Parent means “character” as defined here in Unicode: https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha..., effectively code points. Meanings 2 and 3 in the Unicode glossary here: https://www.unicode.org/glossary/#character
so you replace one costly sweeping with a costly sweeping. i wouldn't call that an advantage in any way over junping n bytes.
what you describe is the bare minimum so you even know what you are searching for while you scan pretty much everything every time.
What do you mean? What would you suggest instead? Fixed-length encoding? It would take a looot of space given all the character variations you can have.
26 replies →
UTF-8 is indeed a genius design. But of course it’s crucially dependent on the decision for ASCII to use only 7 bits, which even in 1963 was kind of an odd choice.
Was this just historical luck? Is there a world where the designers of ASCII grabbed one more bit of code space for some nice-to-haves, or did they have code pages or other extensibility in mind from the start? I bet someone around here knows.
I don't know if this is the reason or if the causality goes the other way, but: it's worth noting that we didn't always have 8 general purpose bits. 7 bits + 1 parity bit or flag bit or something else was really common (enough so that e-mail to this day still uses quoted-printable [1] to encode octets with 7-bit bytes). A communication channel being able to transmit all 8 bits in a byte unchanged is called being 8-bit clean [2], and wasn't always a given.
In a way, UTF-8 is just one of many good uses for that spare 8th bit in an ASCII byte...
[1] https://en.wikipedia.org/wiki/Quoted-printable
[2] https://en.wikipedia.org/wiki/8-bit_clean
"Five characters in a 36 bit word" was a fairly common trick on pre-byte architectures too.
6 replies →
Not an expert but I happened to read about some of the history of this a while back.
ASCII has its roots in teletype codes, which were a development from telegraph codes like Morse.
Morse code is variable length, so this made automatic telegraph machines or teletypes awkward to implement. The solution was the 5 bit Baudot code. Using a fixed length code simplified the devices. Operators could type Baudot code using one hand on a 5 key keyboard. Part of the code's design was to minimize operator fatigue.
Baudot code is why we refer to the symbol rate of modems and the like in Baud btw.
Anyhow, the next change came with instead of telegraph machines directly signaling on the wire, instead a typewriter was used to create a punched tape of codepoints, which would be loaded into the telegraph machine for transmission. Since the keyboard was now decoupled from the wire code, there was more flexibility to add additional code points. This is where stuff like "Carriage Return" and "Line Feed" originate. This got standardized by Western Union and internationally.
By the time we get to ASCII, teleprinters are common, and the early computer industry adopted punched cards pervasively as an input format. And they initially did the straightforward thing of just using the telegraph codes. But then someone at IBM came up with a new scheme that would be faster when using punch cards in sorting machines. And that became ASCII eventually.
So zooming out here the story is that we started with binary codes, then adopted new schemes as technology developed. All this happened long before the digital computing world settled on 8 bit bytes as a convention. ASCII as bytes is just a practical compromise between the older teletype codes and the newer convention.
> But then someone at IBM came up with a new scheme that would be faster when using punch cards in sorting machines. And that became ASCII eventually.
Technically, the punch card processing technology was patented by inventor Herman Hollerith in 1884, and the company he founded wouldn't become IBM until 40 years later (though it was folded with 3 other companies into the Computing-Tabulating-Recording company in 1911, which would then become IBM in 1924).
To be honest though, I'm not clear how ASCII came from anything used by the punch card sorting machines, since it wasn't proposed until 1961 (by an IBM engineer, but 32 years after Hollerith's death). Do you know where I can read more about the progression here?
7 replies →
Fun fact: ASCII was a variable length encoding. No really! It was designed so that one could use overstrike to implement accents and umlauts, and also underline (which still works like that in terminals). I.e., á would be written a BS ' (or ' BS a), à would be written as a BS ` (or ` BS a), ö would be written o BS ", ø would be written as o BS /, ¢ would be written as c BS |, and so on and on. The typefaces were designed to make this possible.
This lives on in compose key sequences, so instead of a BS ' one types compose-' a and so on.
And this all predates ASCII: it's how people did accents and such on typewriters.
This is also why Spanish used to not use accents on capitals, and still allows capitals to not have accents: that would require smaller capitals, but typewriters back then didn't have them.
The use of 8-bit extensions of ASCII (like the ISO 8859-x family) was ubiquitous for a few decades, and arguably still is to some extent on Windows (the standard Windows code pages). If ASCII had been 8-bit from the start, but with the most common characters all within the first 128 integers, which would seem likely as a design, then UTF-8 would still have worked out pretty well.
The accident of history is less that ASCII happens to be 7 bits, but that the relevant phase of computer development happened to primarily occur in an English-speaking country, and that English text happens to be well representable with 7-bit units.
Most languages are well representable with 128 characters (7-bits) if you do not include English characters among those (eg. replace those 52 characters and some control/punctuation/symbols).
This is easily proven by the success of all the ISO-8859-*, Windows and IBM CP-* encodings, and all the *SCII (ISCII, YUSCII...) extensions — they fit one or more languages in the upper 128 characters.
It's mostly CJK out of large languages that fail to fit within 128 characters as a whole (though there are smaller languages too).
Many of the extended characters in ISO 8859-* can be implemented using pure ASCII with overstriking. ASCII was designed to support overstriking for this purpose. Overstriking was how one typed many of those characters on typewriters.
Before this happened, 7-bit ASCII variants based on ISO 646 were widely used.
Historical luck. Though "luck" is probably pushing it in the way one might say certain math proofs are historically "lucky" based on previous work. It's more an almost natural consequence.
Before ASCII there was BCDIC, which was six bits and non-standardized (there were variants, just like technically there are a number of ASCII variants, with the common just referred to as ASCII these days).
BCDIC was the capital English letters plus common punctuation plus numbers. 2^6 is 64, and for capital letters + numbers, you have 36, plus a few common punctuation marks puts you around 50. IIRC the original by IBM was around 45 or something. Slash, period, comma, tc.
So when there was a decision to support lowercase, they added a bit because that's all that was necessary, and I think the printers around at the time couldn't print anything but something less than 128 characters anyway. There wasn't any ó or ö or anything printable, so why support it?
But eventually that yielded to 8-bit encodings (various ASCIIs like latin-1 extended, etc. that had ñ etc.).
Crucially, UTF-8 is only compatible with the 7-bit ASCII. All those 8-bit ASCIIs are incompatible with UTF-8 because they use the eighth bit.
7 bits isn't that odd. Bauddot was 5 bits, and found insufficient, so 6 bit codes were developed; they were found insufficient, so 7-bit ASCII was developed.
IBM had standardized 8-bit bytes on their System/360, so they developed the 8-bit EBCDIC encoding. Other computing vendors didn't have consistent byte lengths... 7-bits was weird, but characters didn't necessarily fit nicely into system words anyway.
I don't really say this to disagree with you, but I feel weird about the phrasing "found insufficient", as if we reevaluated and said 'oops'.
It's not like 5-bit codes forgot about numbers and 80% of punctuation, or like 6-bit codes forgot about having upper and lower case letters. They were clearly 'insufficient' for general text even as the tradeoff was being made, it's just that each bit cost so much we did it anyway.
The obvious baseline by the time we were putting text into computers was to match a typewriter. That was easy to see coming. And the symbols on a typewriter take 7 bits to encode.
2 replies →
The idea was that the free bit would be repurposed, likely for parity.
This is not true. ASCII (technically US-ASCII) was a fixed-width encoding of 7 bits. There was no 8th bit reserved. You can read the original standard yourself here: https://ia600401.us.archive.org/23/items/enf-ascii-1968-1970...
Crucially, "the 7-bit coded character set" is described on page 6 using only seven total bits (1-indexed, so don't get confused when you see b7 in the chart!).
There is an encoding mechanism to use 8 bits, but it's for storage on a type of magnetic tape, and even that still is silent on the 8th bit being repurposed. It's likely, given the lack of discussion about it, that it was for ergonomic or technical purposes related to the medium (8 is a power of 2) rather than for future extensibility.
2 replies →
When ASCII was invented, 36-bit computers were popular, which would fit five ASCII characters with just one unused bit per 36-bit word. Before, 6-bit character codes were used, where a 36-bit word could fit six of them.
I would love to think this is true, and it makes sense, but do you have any actual evidence for this you could share with HN?
I'm not sure, but it does seem like a great bit of historical foresight. It stands as a lesson to anyone standardizing something: wanna use a 32 bit integer? Make it 31 bits. Just in case. Obviously, this isn't always applicable (e.g. sizes, etc..), but the idea of leaving even the smallest amount of space for future extensibility is crucial.
https://www.sensitiveresearch.com/Archive/CharCodeHist/X3.4-...
Looks to me like serendipity - they thought 8 bits would be wasteful, they didnt have a need for that many characters.
UTF-8 is as good as a design as could be expected, but Unicode has scope creep issues. What should be in Unicode?
Coming at it naively, people might think the scope is something like "all sufficiently widespread distinct, discrete glyphs used by humans for communication that can be printed". But that's not true, because
* It's not discrete. Some code points are for combining with other code points.
* It's not distinct. Some glyphs can be written in multiple ways. Some glyphs which (almost?) always display the same, have different code points and meanings.
* It's not all printable. Control characters are in there - they pretty much had to be due to compatibility with ASCII, but they've added plenty of their own.
I'm not aware of any Unicode code points that are animated - at least what's printable, is printable on paper and not just on screen, there are no marquee or blink control characters, thank God. But, who knows when that invariant will fall too.
By the way, I know of one utf encoding the author didn't mention, utf-7. Like utf-8, but assuming that the last bit wasn't safe to use (apparently a sensible precaution over networks in the 80s). My boss managed to send me a mail encoded in utf-7 once, that's how I know what it is. I don't know how he managed to send it, though.
Indeed, one pain point of unicode is CJK unification. https://heistak.github.io/your-code-displays-japanese-wrong/
the fact that there is seemingly no interest in fixing this, and if you want chinese and japanese in the same document, you're just fucked, forever, is crazy to me.
They should add separate code points for each variant and at least make it possible to avoid the problem in new documents. I've heard the arguments against this before, but the longer you wait, the worse the problem gets.
9 replies →
UTF-7 is/was mostly for email, which is not an 8-bit clean transport. It is obsolete and can't encode supplemental planes (except via surrogate pairs, which were meant for UTF-16).
There is also UTF-9, from an April Fools RFC, meant for use on hosts with 36-bit words such as the PDP-10.
I meant to specify, the aim of UTF-7 is better performed by using UTF-8 with `Content-Transfer-Encoding: quoted-printable`
The problem is the solution here. Add obscure stuff to the standard, and not everything will support it well. We got something decent in the end, different languages' scripts will mostly show up well on all sorts of computers. Apple's stuff like every possible combination of skin tone and gender family emoji might not.
Unicode wanted ability to losslessly roundtrip every other encoding, in order to be easy to partially adopt in a world where other encodings were still in use. It merged a bunch of different incomplete encodings that used competing approaches. That's why there are multiple ways of encoding the same characters, and there's no overall consistency to it. It's hard to say whether that was a mistake. This level of interoperability may have been necessary for Unicode to actually win, and not be another episode of https://xkcd.com/927
Why did Unicode want codepointwise round-tripping? One codepoint in a legacy encoding becoming two in Unicode doesn't seem like it should have been a problem. In other words, why include precomposed characters in Unicode?
> * It's not discrete. Some code points are for combining with other code points.
This isn't "scope creep". It's a reflection of reality. People were already constructing compositions like this is real life. The normalization problem was unavoidable.
For more on UTF-8's design, see Russ Cox's one-pager on it:
https://research.swtch.com/utf8
And Rob Pike's description of the history of how it was designed:
https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
Thank you for posting these - the Bell Labs crew is just a different breed.
Of course it's Pike and Thompson and the gang. The amount of contributions these guys made to the world of computing is insane.
I consider "designed on a placemat" to be a selling point for any high-quality standard that will last.
In case no one mentioned that yet in this thread, here's the story how it was invented by Ken Thompson & Rob Pike over a dinner.
https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
https://doc.cat-v.org/bell_labs/utf-8_history
One thing I always wonder: It is possible to encode a unicode codepoint with too much bytes. UTF-8 forbids these, only the shortest one is valid. E.g 00000001 is the same as 11000000 10000001.
So why not make the alternatives impossible by adding the start of the last valid option? So 11000000 10000001 would give codepoint 128+1 as values 0 to 127 are already covered by a 1 byte sequence.
The advantages are clear: No illegal codes, and a slightly shorter string for edge cases. I presume the designers thought about this, so what were the disadvantages? The required addition being an unacceptable hardware cost at the time?
UPDATE: Last bitsequence should of course be 10000001 and not 00000001. Sorry for that. Fixed it.
The siblings so far talk about the synchronizing nature of the indicators, but that's not relevant to your question. Your question is more of
Why is U+0080 encoded as c2 80, instead of c0 80, which is the lowest sequence after 7f?
I suspect the answer is
a) the security impacts of overlong encodings were not contemplated; lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings
b) utf-8 as standardized allows for encode and decode with bitmask and bitshift only. Your proposed encoding requires bitmask and bitshift, in addition to addition and subtraction
You can find a bit of email discussion from 1992 here [1] ... at the very bottom there's some notes about what became utf-8:
> 1. The 2 byte sequence has 2^11 codes, yet only 2^11-2^7 are allowed. The codes in the range 0-7f are illegal. I think this is preferable to a pile of magic additive constants for no real benefit. Similar comment applies to all of the longer sequences.
The included FSS-UTF that's right before the note does include additive constants.
[1] https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
Oops yeah. One of my bit sequences is of course wrong and seems to have derailed this discussion. Sorry for that. Your interpretation is correct.
I've seen the first part of that mail, but your version is a lot longer. It is indeed quite convincing in declaring b) moot. And security was not that big of a thing then as it is now, so you're probalbly right
A variation of a) is comparing strings as UTF-8 byte sequences if overlong encodings are also accepted (before and/or later). This leads to situations where strings tested as unequal are actually equal in terms of code points.
2 replies →
See quectophoton's comment—the requirement that continuation bytes are always tagged with a leading 10 is useful if a parser is jumping in at a random offset—or, more commonly, if the text stream gets fragmented. This was actually a major concern when UTF-8 was devised in the early 90s, as transmission was much less reliable than it is today.
Addendum: This was posted to the front page today: https://doc.cat-v.org/bell_labs/utf-8_history
It also notes that UTF-8 protects against the dangers of NUL and '/' appearing in filenames, which would kill C strings and DOS path handling, respectively.
I assume you mean "11000000 10000001" to preserve the property that all continuation bytes start with "10"? [Edit: looks like you edited that in]. Without that property, UTF-8 loses self-synchronicity, the property that given a truncated UTF-8 stream, you can always find the codepoint boundaries, and will lose at most codepoint worth rather than having the whole stream be garbled.
In theory you could do it that way, but it comes at the cost of decoder performance. With UTF-8, you can reassemble a codepoint from a stream using only fast bitwise operations (&, |, and <<). If you declared that you had to subtract the legal codepoints represented by shorter sequences, you'd have to introduce additional arithmetic operations in encoding and decoding.
That would make the calculations more complicated and a little slower. Now you can do a few quick bit shifts. This was more of an issue back in the '90s when UTF-8 was designed and computers were slower.
https://en.m.wikipedia.org/wiki/Self-synchronizing_code
Because then it would be impossible to tell from looking at a byte whether it is the beginning of a character or not, which is a useful property of UTF-8.
I think that would garble random access?
I have a love-hate relationship with backwards compatibility. I hate the mess - I love when an entity in a position of power is willing to break things in the name of advancement. But I also love the cleverness - UTF-8, UTF-16, EAN, etc. To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.
> To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.
It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1.
I hope we don’t regret this limitation some day. I’m not aware of any other material reason to disallow larger UTF-8 code units.
That isn't really a case of UTF-8 sacrificing anything to be compatible with UTF-16. It's Unicode, not UTF-8 that made the sacrifice: Unicode is limited to 21 bits due to UTF-16. The UTF-8 design trivially extends to support 6 byte long sequences supporting up to 31-bit numbers. But why would UTF-8, a Unicode character encoding, support code points which Unicode has promised will never and can never exist?
5 replies →
> It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1z
Yes, it is 'truncated' to the "UTF-16 accessible range":
* https://datatracker.ietf.org/doc/html/rfc3629#section-3
* https://en.wikipedia.org/wiki/UTF-8#History
Thompson's original design could handle up to six octets for each letter/symbol, with 31 bits of space:
* https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
1 reply →
It's always dangerous to stick one's neck out and say "[this many bits] ought to be enough for anybody", but I think it's very unlikely we'll ever run out of UTF-8 sequences. UTF-8 can represent about 1.1 million code points, of which we've assigned about 160,000 actual characters, plus another ~140,000 in the Private Use Area, which won't expand. And that's after getting nearly all of the world's known writing systems: the last several Unicode updates have added a few thousand characters here and there for very obscure and/or ancient writing systems, but those won't go on forever (and things like emojis rarely only get a handful of new code points per update, because most new emojis are existing code points with combining characters).
If I had to guess, I'd say we'll run out of IPv6 addresses before we run out of unassigned UTF-8 sequences.
1 reply →
> It sacrifices the ability to encode more than 21 bits
No, UTF-8's design can encode up to 31 bits of codepoints. The limitation to 21 bits comes from UTF-16, which was then adopted for UTF-8 too. When UTF-16 dies we'll be able to extend UTF-8 (well, compatibility will be a problem).
That limitation will be trivial to lift once UTF-16 compatibility can be disregarded. This won’t happen soon, of course, given JavaScript and Windows, but the situation might be different in a hundred or thousand years. Until then, we still have a lot of unassigned code points.
In addition, it would be possible to nest another surrogate-character-like scheme into UTF-16 to support a larger character set.
the limitation tomorrow will be today's implementations, sadly.
> I love when an entity in a position of power is willing to break things in the name of advancement.
It's less fun when you have things that need to keep working break because someone felt like renaming a parameter, or that a part of the standard library looks "untidy"
I agree! And yet I lovingly sacrifice my man-hours to it when I decide to bump that major version number in my dependency manifest.
5 replies →
> To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.
There were apps that completely rejected non-7-bit data back in the day. Backwards compatibility wasn't the only point. The point of UTF-8 is more (IMO) that UTF-32 is too bulky, UCS-2 was insufficient, UTF-16 was an abortion, and only UTF-8 could have the right trade-offs.
Yeah I honestly don't know what I would change. Maybe replace some of the control characters with more common characters to save a tiny bit of space, if we were to go completely wild and break Unicode backward compatibility too. As a generic multi byte character encoding format, it seems completely optimal even in isolation.
Love the UTF-8 playground that's linked: https://utf8-playground.netlify.app/
Would be great if it was possible to enter codepoints directly; you can do it via the URL (`/F8FF` eg), but not in the UI. (Edit, the future is now. https://github.com/vishnuharidas/utf8-playground/pull/6)
Thanks for the contribution, this is now merged and live.
I’ve re-read so many times Joel’s article on Unicode. It’s also very helpful.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
Read that a few times back then as well, but that and other pieces of the day never told you how to actually write a program that supported Unicode. Just facts about it.
So I went around fixing UnicodeErrors in Python at random, for years, despite knowing all that stuff. It wasn't until I read Batchelder's piece on the "Unicode Sandwich," about a decade later that I finally learned how to write a program to support it properly, rather than playing whack-a-mole.
> ... Batchelder's piece on the "Unicode Sandwich," ...
Is this the piece you mean? https://nedbatchelder.com/text/unipain.html
1 reply →
^Necessary but not sufficient.
UTF-8 is simply genius. It entirely obviated the need for clunky 2-byte encodings (and all the associated nonsense about byte order marks).
The only problem with UTF-8 is that Windows and Java were developed without knowledge about UTF-8 and ended up with 16-bit characters.
Oh yes, and Python 3 should have known better when it went through the string-bytes split.
UTF-16 made lots of sense at the time because Unicode thought "65,536 characters will be enough for anybody" and it retains the 1:1 relationship between string elements and characters that everyone had assumed for decades. I.e., you can treat a string as an array of characters and just index into it with an O(1) operation.
As Unicode (quickly) evolved, it turned out not that only are there WAY more than 65,000 characters, there's not even a 1:1 relationship between code points and characters, or even a single defined transformation between glyphs and code points, or even a simple relationship between glyphs and what's on the screen. So even UTF-32 isn't enough to let you act like it's 1980 and str[3] is the 4th "character" of a string.
So now we have very complex string APIs that reflect the actual complexity of how human language works...though lots of people (mostly English-speaking) still act like str[3] is the 4th "character" of a string.
UTF-8 was designed with the knowledge that there's no point in pretending that string indexing will work. Windows, MacOS, Java, JavaScript, etc. just missed the boat by a few years and went the wrong way.
I think more effort should have been made to live with 65,536 characters. My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis. I think that adding emojis to unicode is going to be seen a big mistake. We already have enough network bandwith to just send raster graphics for images in most cases. Cluttering the unicode codespace with emojis is pointless.
14 replies →
Yeah, Java and Windows NT3.1 had really bad timing. Both managed to include Unicode despite starting development before the Unicode 1.0 release, but both added unicode back when Unicode was 16 bit and the need for something like UTF-8 was less clear
NeXTstep was also UTF-16 through OpenStep 4.0, IIRC. Apple was later able to fix this because the string abstraction in the standard library was complete enough no one actually needed to care about the internal representation, but the API still retains some of the UTF-16-specific weirdnesses.
It should be noted that the final design for UTF-8 was sketched out on a placemat by Rob Pike and Ken Thompson.
I wonder if that placemat still exists today. It would be such an important piece of computer history.
> It was so easy once we saw it that there was no reason to keep the placemat for notes, and we left it behind. Or maybe we did bring it back to the lab; I'm not sure. But it's gone now.
https://commandcenter.blogspot.com/2020/01/utf-8-turned-20-y...
UTF-8 is great and I wish everything used it (looking at you JavaScript). But it does have a wart in that there are byte sequences which are invalid UTF-8 and how to interpret them is undefined. I think a perfect design would define exactly how to interpret every possible byte sequence even if nominally "invalid". This is how the HTML5 spec works and it's been phenomenally successful.
For security reasons, the correct answer on how process invalid UTF-8 is (and needs to be) "throw away the data like it's radioactive, and return an error." Otherwise you leave yourself wide open to validation bypass attacks at many layers of your stack.
This is rarely the correct thing to do. Users don't particularly like it if you refuse to process a document because it has an error somewhere in there.
Even for identifiers you probably want to do all kinds of normalization even beyond the level of UTF-8 so things like overlong sequences and other errors are really not an inherent security issue.
This is only true because the interpretation is not defined, so different implementations do different things.
1 reply →
> This is how the HTML5 spec works and it's been phenomenally successful.
Unicode does have a completely defined way to interpret invalid UTF-8 byte sequences by replacing them with the U+FFFD ("replacement character"). You'll see it used (for example) in browsers all the time.
Mandating acceptance for every invalid input works well for HTML because it's meant to be consumed (primarily) by humans. It's not done for UTF-8 because in some situations it's much more useful to detect and report errors instead of making an automatic correction that can't be automatically detected after the fact.
> But it does have a wart in that there are byte sequences which are invalid UTF-8 and how to interpret them is undefined.
This is not a wart. And how to interpret them is not undefined -- you're just not allowed to interpret them as _characters_.
There is right now a discussion[0] about adding a garbage-in/garbage-out mode to jq/jaq/etc that allows them to read and output JSON with invalid UTF-8 strings representing binary data in a way that round-trips. I'm not for making that the default for jq, and you have to be very careful about this to make sure that all the tools you use to handle such "JSON" round-trip the data. But the clever thing is that the proposed changes indeed do not interpret invalid byte sequences as character data, so they stay within the bounds of Unicode as long as your terminal (if these binary strings end up there) and other tools also do the same.
[0] https://github.com/01mf02/jaq/issues/309
I remember a time before UTF-8's ubiquity. It was such a headache moving to i18z. I love UTF-8.
I remember learning Japanese in the early 2000s and the fun of dealing with multiple encodings for the same language: JIS, Shift-JIS, and EUC. As late as 2011 I had to deal with processing a dataset encoded under EUC in Python 2 for a graduate-level machine learning course where I worked on a project for segmenting Japanese sentences (typically there are no spaces in Japanese sentences).
UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!
On the other hand, you now have to deal with the issues of Han unification: https://en.wikipedia.org/wiki/Han_unification#Examples_of_la...
I live in Japan and I still receive the random email or work document encoded in Shit-JIS. Mojibake is not as common as it once was, but still a problem.
1 reply →
I worked on a site in the late 90s which had news in several Asian languages, including both simplified and traditional Chinese. We had a partner in Hong Kong sending articles and being a stereotypical monolingual American I took them at their word that they were sending us simplified Chinese and had it loaded into our PHP app which dutifully served it with that encoding. It was clearly Chinese so I figured we had that feed working.
A couple of days later, I got an email from someone explaining that it was gibberish — apparently our content partner who claimed to be sending GB2312 simplified Chinese was in fact sending us Big5 traditional Chinese so while many of the byte values mapped to valid characters it was nonsensical.
I worked on an email client. Many many character set headaches.
If you want to delve deeper into this topic and like the Advent of Code format, you're in luck: i18n-puzzles[1] has a bunch of puzzles related to text encoding that drill how UTF-8 (and other variants such as UTF-16) work into your brain.
[1]: https://i18n-puzzles.com/
Meanwhile Shift-JIS has a bad design, since the second byte of a character can be any ASCII character 0x40-0x9E. This includes brackets, backslash, caret, backquote, curly braces, pipe, and tilde. This can cause a path separator or math operator to appear in text that is encoded as Shift-JIS but interpreted as plain ASCII.
UTF-8 basically learned from the mistakes of previous encodings which allowed that kind of thing.
Rob Pike and Ken Thompson are brilliant computer scientists & engineers.
[flagged]
I need to call out a myth about UTF-8. Tools built to assume UTF-8 are not backwards compatible with ASCII. An encoding INCLUDES but also EXCLUDES. When a tool is set to use UTF-8, it will process an ASCII stream, but it will not filter out non-ASCII.
I still use some tools that assume ASCII input. For many years now, Linux tools have been removing the ability to specify default ASCII, leaving UTF-8 as the only relevant choice. This has caused me extra work, because if the data processing chain goes through these tools, I have to manually inspect the data for non-ASCII noise that has been introduced. I mostly use those older tools on Windows now, because most Windows tools still allow you to set default ASCII.
The usual statement isn't that UTF-8 is backwards compatible with ASCII (it's obvious that any 8-bit encoding wouldn't be; that's why we have UTF-7!). It's that UTF-8 is backwards compatible with tools that are 8-bit clean.
Yes, the myth I was pointing out is based on loose terminology. It needs to be made clear that "backwards compatible" means that UTF-8 based tools can receive but are not constrained to emit valid ASCII. I see a lot of comments implying that UTF-8 can interact with an ASCII ecosystem without causing problems. Even worse, it seems most Linux developers believe there is no longer a need to provide a default ASCII setting if they have UTF-8.
Do you have an actual example where this causes an issue? "ASCII" tools mostly just passed along non-ASCII bytes unchanged even before UTF-8.
That's not a myth about UTF-8. That's a decision by tools not to support pure ASCII.
While the backward compatibility of utf-8 is nice, and makes adoption much easier, the backward compatibility does not come at any cost to the elegance of the encoding.
In other words, yes it's backward compatible, but utf-is also compact and elegant even without that.
UTF-8 also enables this mindblowing design for small string optimization - if the string has 24 bytes or less it is stored inline, otherwise it is stored on the heap (with a pointer, a length, and a capacity - also 24 bytes)
https://news.ycombinator.com/item?id=41339224)
How is that UTF8 specific?
3 replies →
Karpathy's "Let's build the GPT Tokenizer" also contains a good introduction to Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32 in the first 20 minutes: https://www.youtube.com/watch?v=zduSFxRajkE
It's worth noting that Stallman had earlier proposed a design for Emacs "to handle all the world's alphabets and word signs" with similar requirements to UTF-8. That was the etc/CHARACTERS file in Emacs 18.59 (1990). The eventual international support implemented in Emacs 20's MULE was based on ISO-2022, which was a reasonable choice at the time, based on earlier Japanese work. (There was actually enough space in the MULE encoding to add UTF-8, but the implementation was always going to be inefficient with the number of bytes at the top of the code space.)
Edit: see https://raw.githubusercontent.com/tsutsui/emacs-18.59-netbsd...
Great example of a technology you get from a brilliant guy with a vision and that you'll never get out of a committee.
A little off topic but amidst a lot of discussion of UTF-8 and its ASCII compatibility property I'm going to mention my one gripe with ASCII, something I never see anyone talking about, something I've never talked about before: The damn 0x7f character. Such an annoying anomaly in every conceivable way. It would be much better if it was some other proper printable punctuation or punctuation adjacent character. A copyright character. Or a pi character or just about anything other than what it already is. I have been programming and studying packet dumps long enough that I can basically convert hex to ASCII and vice versa in my head but I still recoil at this anomalous character (DELETE? is that what I should call it?) every time.
Much better in every way except the one that mattered most: being able to correct punching errors in a paper tape without starting over.
I don't know if you have ever had to use White-Out to correct typing errors on a typewriter that lacked the ability natively, but before White-Out, the only option was to start typing the letter again, from the beginning.
0x7f was White-Out for punched paper tape: it allowed you to strike out an incorrectly punched character so that the message, when it was sent, would print correctly. ASCII inherited it from the Baudot–Murray code.
It's been obsolete since people started punching their tapes on computers instead of Teletypes and Flexowriters, so around 01975, and maybe before; I don't know if there was a paper-tape equivalent of a duplicating keypunch, but that would seem to eliminate the need for the delete character. Certainly TECO and cheap microcomputers did.
Nice, thanks.
Related: Why is there a “small house” in IBM's Code page 437? (glyphdrawing.club) [1]. There are other interesting articles mentioned in the discussion. m_walden probably would comment here himself
[1] https://news.ycombinator.com/item?id=43667010
Thanks, interesting.
I once saw a good byte encoding for Unicode: 7 bit for data, 1 for continuation/stop. This gives 21 bit for data, which is enough for the whole range. ASCII compatible, at most 3 bytes per character. Very simple: the description is sufficient to implement it.
Probably a good idea, but when UTF-8 was designed the Unicode committee had not yet made the mistake of limiting the character range to 21 bits. (Going into why it's a mistake would make this comment longer than it's worth, so I'll only expound on it if anyone asks me to). And at this point it would be a bad idea to switch away from the format that is now, finally, used in over 99% of all documents online. The gain would be small (not zero, but small) and the cost would be immense.
Didn't they limit the range to 21 bits because UTF-16 has that limitation?
6 replies →
This fits your description: https://en.wikipedia.org/wiki/Variable-length_quantity
It took time for UTF-8 to make sense. Struggling with how large everything was was a real problem just after the turn of the century. Today it makes more sense because capacity and compute power is much greater but back then it was a huge pain in the ass.
It made much more sense than UTF-16 or any of the existing multi-byte character sets, and the need for more than 256 characters had been apparent for decades. Seeing its simplicity, it made perfect sense almost immediately.
No, it didn't. Not at the time. Like I said processing and storage were a pain back around the 2000-ish time. Windows supported UCS-2 (predecessor to UTF-16) which was fixed width 16-bit and faster to decode and encode, and since most of the world was Windows at the time, it made more sense to use UCS-2. Also, the world was only beginning to be more connected so UTF-8 seemed overkill.
NOW in hindsight it makes more sense to use UTF-8 but it wasn't clear back 20 years ago it was worth it.
2 replies →
Even for varints (you could probably drop the intermediate prefixes for that). There are many examples of using SIMD to decode utf-8, where-as the more common protobuf scheme is known to be hostile to SIMD and the branch predictor.
Yeah, protobuf's varint are quite hard to decode with current SIMD instructions, but it would be quite easy, if we get element wise pext/pdep instructions in the future. (SVE2 already has those, but who has SVE2?)
I have always wondered - what if the utf-8 space is filled up? Does it automatically promote to having a 5th byte? Is that part of the spec? Or are we then talking about utf-16?
UTF-8 can represent up to 1,114,112 characters in Unicode. And in Unicode 15.1 (2023, https://www.unicode.org/versions/Unicode15.1.0/) a total of 149,813 characters are included, which covers most of the world's languages, scripts, and emojis. That leaves a 960K space for future expansion.
So, it won't fill up during our lifetime I guess.
I wouldn't be too quick to jump to that conclusion, we could easily shove another 960k emojis into the spec!
1 reply →
Wait until we get to know another specie then we will not just fill that Unicode space, but we will ditch any utf-16 compatibility so fast that will make your head spin on a snivel.
Imagine the code points we'll need to represent an alien culture :).
Nothing is automatic.
If we ever needed that many characters, yes the most obvious solution would be a fifth byte. The standard would need to be explicitly extended though.
But that would probably require having encountered literate extraterrestrial species to collect enough new alphabets to fill up all the available code points first. So seems like it would be a pretty cool problem to have.
utf-8 is just an encoding of unicode. UTF-8 is specified in a way so that it can encode all unicode codepoints up to 0x10FFFF. It doesn't extend further. And UTF-16 also encodes unicode in a similar same way, it doesn't encode anything more.
So what would need to happen first would be that unicode decides they are going to include larger codepoints. Then UTF-8 would need to be extended to handle encoding them. (But I don't think that will happen.)
It seems like Unicode codepoints are less than 30% allocated, roughly. So there's 70% free space..
---
Think of these three separate concepts to make it clear. We are effectively dealing with two translations - one from the abstract symbol to defined unicode code point. Then from that code point we use UTF-8 to encode it into bytes.
1. The glyph or symbol ("A")
2. The unicode code point for the symbol (U+0041 Latin Capital Letter A)
3. The utf-8 encoding of the code point, as bytes (0x41)
As an aside: UTF-8, as originally specified in RFC 2279, could encode codepoints up to U+7FFFFFFF (using sequences of up to six bytes). It was later restricted to U+10FFFF to ensure compatibility with UTF-16.
[dead]
I take it you could choose to encode a code point using a larger number of bytes than are actually needed? E.g., you could encode "A" using 1, 2, 3 or 4 bytes?
Because if so: I don't really like that. It would mean that "equal sequence of code points" does not imply "equal sequence of encoded bytes" (the converse continues to hold, of course), while offering no advantage that I can see.
Well, yes, Ken Thompson, the father of Unix, is behind it.
UTF-8 is a undeniably a good answer, but to a relatively simple bit twiddling / variable len integer encoding problem in a somewhat specific context.
I realize that hindsight is 20/20, and time were different, but lets face it: "how to use an unused top bit to best encode larger number representing Unicode" is not that much of challenge, and the space of practical solutions isn't even all that large.
Except that there were many different solutions before UTF-8, all of which sucked really badly.
UTF-8 is the best kind of brilliant. After you've seen it, you (and I) think of it as obvious, and clearly the solution any reasonable engineer would come up with. Except that it took a long time for it to be created.
I just realised that all latin text is wasting 12% of storage/memory/bandwidth with MSB zero. At least is compresses well. Are there any technology that utilizes 8th bit for something useful, e.g. error checking?
See mort96's comments about 7-bit ASCII and parity bits (https://news.ycombinator.com/item?id=45225911). Kind of archaic now, though - 8-bit bytes with the error checking living elsewhere in the stack seems to be preferred.
One aspect of Unicode that is probably not obvious is that with Unicode it is possible to keep using old encodings just fine. You can always get their Unicode equivalents, this is what Unicode was about. Otherwise just keep the data as is, tagged with the encoding. This nicely extends to filesystem "encodings" too.
For example, modern Python internally uses three forms (Latin-1, UTF-16 and 32) depending on the contents of the string. But this can be done for all encodings and also for things like file names that do not follow Unicode. The Unicode standard does not dictate everything must take the same form; it can be used to keep existing forms but make them compatible.
UTF-8 is a nice extension for ASCII from the compatibility point of view, but it might be not the most compact especially if the text is not English like. Also, the variable character length makes it inconvenient to work with strings unless they are parsed/saved into/from 2/4 byte char array.
> Every ASCII encoded file is a valid UTF-8 file.
More importantly, that file has the same meaning. Same with the converse.
Nice article, thank you. I love UTF-8, but I only advocate it when used with a BOM. Otherwise, an application may have no way of knowing that it is UTF-8, and that it needs to be saved as UTF-8.
Imagine selecting New/Text Document in an environment like File Explorer on Windows: if the initial (empty) file has a BOM, any app will know that it is supposed to be saved again as UTF-8 once you start working on it. But with no BOM, there is no such luck, and corruption may be just around the corner, even when the editor tries to auto-detect the encoding (auto-detection is never easy or 100% reliable, even for basic Latin text with "special" characters)
The same can happen to a plain ASCII file (without a BOM): once you edit it, and you add, say, some accented vowel, the chaos begins. You thought it was Italian, but your favorite text editor might conclude it's Vietnamese! I've even seen Notepad switch to a different default encoding after some Windows updates.
So, UTF-8 yes, but with a BOM. It should be the default in any app and operating system.
The fact that you advocate using a BOM with UTF-8 tells me that you run Windows. Any long-term Unix user has probably seen this error message before (copy and pasted from an issue report I filed just 3 days ago):
If you've got any experience with Linux, you probably suspect the problem already. If your only experience is with Windows, you might not realize the issue. There's an invisible U+FEFF lurking before the `#!`. So instead of that shell script starting with the `#!` character pair that tells the Linux kernel "The application after the `#!` is the application that should parse and run this file", it actually starts with `<FEFF>#!`, which has no meaning to the kernel. The way this script was invoked meant that Bash did end up running the script, with only one error message (because the line did not start with `#` and therefore it was not interpreted as a Bash comment) that didn't matter to the actual script logic.
This is one of the more common problems caused by putting a BOM in UTF-8 files, but there are others. The issue is that adding a BOM, as can be seen here, *breaks the promise of UTF-8*: that a UTF-8 file that contains only codepoints below U+007F can be processed as-is, and legacy logic that assumes ASCII will parse it correctly. The Linux kernel is perfectly aware of UTF-8, of course, as is Bash. But the kernel logic that looks for `#!`, and the Bash logic that look for a leading `#` as a comment indicator to ignore the line, do *not* assume a leading U+FEFF can be ignored, nor should they (for many reasons).
What should happen is that these days, every application should assume UTF-8 if it isn't informed of the format of the file, unless and until something happens to make it believe it's a different format (such as reading a UTF-16 BOM in the first two bytes of the file). If a file fails to parse as UTF-8 but there are clues that make another encoding sensible, reparsing it as something else (like Windows-1252) might be sensible.
But putting a BOM in UTF-8 causes more problems than it solves, because it *breaks* the fundamental promise of UTF-8: ASCII compatibility with Unicode-unaware logic.
I like your answer, and the others too, but I suspect I have an even worse problem than running Windows: I am an Amiga user :D
The Amiga always used all 8 bits (ISO-8859-1 by default), so detecting UTF-8 without a BOM is not so easy, especially when you start with an empty file, or in some scenario like the other one I mentioned.
And it's not that Macs and PCs don't have 8-bit legacy or coexistence needs. What you seem to be saying is that compatibility with 7-bit ASCII is sacred, whereas compatibility with 8-bit text encodings is not important.
Since we now have UTF-8 files with BOMs that need to be handled anyway, would it not be better if all the "Unicode-unaware" apps at least supported the BOM (stripping it, in the simplest case)?
4 replies →
Also some XML parsers I used choked on UTF-8 BOMs. Not sure if valid XML is allowed to have anything other than clean ASCII in the first few characters before declaring what the encoding is?
2 replies →
I respectfully disagree. The BOM is a Windows-specific idiosyncrasy resulting from its early adoption of UTF-16. In the Unix world, a BOM is unexpected and causes problems with many programs, such as GCC, PHP and XML parsers. Don't use it!
The correct approach is to use and assume UTF-8 everywhere. 99% of websites use UTF-8. There is no reason to break software by adding a BOM.
BOM is awful as it breaks concatenation. In modern world everything should be just assumed to be UTF8 by default.
You do not need a BOM for UTF-8. Ever. Byte order issues are not a problem for UTF-8 because UTF-8 is manipulated as a string of _bytes_, not as a string of 16-bit or 32-bit code units.
You _do_ need a BOM for UTF-16 and UTF-32.
In a pure UTF-8 world we would not need it, sure. I get that point. But what do you want to do with 40+ years worth of text files that came after 7-bit ASCII, where they may coexist with UTF-8? If we want to preserve our past, the practical solution is that the OS or app has a default character set for 8-bit text encoding, in addition to supporting (and using as a default) UTF-8.
I also agree that "BOM" is the wrong name for an UTF-8... BOM. Byte order is not the issue. But still, it's a header that says that the file, even if empty, is UTF-8. Detecting an 8-bit legacy character set is much more difficult that recognizing (skipping) a BOM.
3 replies →
I had fun building a little UTF-8 playground about a year ago to drive some of the concepts home and to go back to as a cheatsheet.
https://github.com/incanus/learn-utf8
I made an interactive one since I couldn't find anything that allows individually set/unset bits and see what happens. Here: https://utf8-playground.netlify.app/
Very nice! That's fun to play with.
So why wasn‘t 10 chosen as the 2-bytes prefix to gain 1 bit? Easier to detect encoding errors?
UTF-8 is a neat way of encoding 1M+ code points in 8 bit bytes, and including 7 bit ASCII. If only unicode were as neat - sigh. I guess it's way too late to flip unicode versions and start again avoiding the weirdness.
The story is that Ken and Rob were at a diner when Ken gave structure to it and wrote the initial encode/decode functions on napkins. UTF-8 is so simple yet it required a complex mind to do it.
Love reading explorations of structures and technical phenomena that are basically the digital equivalent of oxygen in their ubiquity and in how we take them for granted
Excellent Computerphile video with Tom Scott about this as well:
https://youtu.be/MijmeoH9LT4
UTF-8 contributors are some of our modern day unsung heroes. The design is brilliant but the dedication to encode every single way humans communicate via text into a single standard, and succeed at it, is truly on another level.
Most other standards just do the xkcd thing: "now there's 15 competing standards"
No, it's not. It's just a form of Elias-Gamma coding.
* unary encoding coding.
UTF-8 Everywhere Manifesto: https://utf8everywhere.org/
I read online that codepoints are formatted with 4 hex chars for historical reasons. U+41 (Latin A) is formatted as U+0041.
UTF-8 was a huge improvement for sure, but I was, 20-25 years ago, working with LATIN-1 (so 8 bit charcters) which was a struggle in the years it took for everything to switch to UTF-8, the compatibility with ASCII meant you only really notice something was wrong when the data had special characters not representable in ASCII but valid LATIN-1. So perhaps breaking backwards compatibility would've resulted in less data corruption overall.
Until you interact with it as a programmer
Hmm i count at most 21 bits. Just 2 billion code points.
Is that all Unicode can do? How are they going to fit all the emojis in?
The max code point in Unicode is 0x10FFFF. ceil(log2(0x10FFFF+1)) = 21. So yes, a Unicode codepoint requires only 21 bits.
297334 codepoints have been assigned so far, that‘s about 1/4 of the available range, if my napkin math is right. Plenty of room for more emoji.
Seems obvious, ASCII had an unused bit, so you use it. Why did they even bother with UTF-16 and -32 then?
Because the original design assumed that 16 bits are enough to encode everything worth encoding, hence UCS2 (not UTF-16, yet) being the easiest and most straightforward way to represent things.
Ah ok. Well even then, you end up spending 16 bits for every ASCII character.
Uvarint also has the property of a file containing only ascii characters still being a valid ascii file.
Anyone remembers what UTF7.5 or UTF7,5 was? I can't find references to its description(s)...
Finally found a description here: http://www.czyborra.com/utf/
I'll mention IPv6 as bad design that could have been potentially UTF-8-like success story
No. UTF-8 is for encoding text, so we don't need to care about it being variable length because text was already variable length.
The network addresses aren't variable length, so if you decide "Oh IPv6 is variable length" then you're just making it worse with no meaningful benefit.
The IPv4 address is 32 bits, the IPv6 address is 128 bits. You could go 64 but it's much less clear how to efficiently partition this and not regret whatever choices you do make in the foreseeable future. The extra space meant IPv6 didn't ever have those regrets.
It suits a certain kind of person to always pay $10M to avoid the one-time $50M upgrade cost. They can do this over a dozen jobs in twenty years, spending $200M to avoid $50M cost and be proud of saving money.
You reserve 32 bits of these 128 just like UTF-8 did for theirs for ASCII for backward-compatibility, and request backward compatible fall-back from user interfaces, I hope it clears it
Good read, thank you!
> Show the character represented by the remaiing 7 bits on the screen.
I notice there is a typo.
Fixed that, thank you!
So brilliant that we’re all still using ASCII!†
† With an occasional UNICODE flourish.
some insightful unicode regex examples...
https://dev.to/bbkr/utf-8-internal-design-5c8b
Regex? Did you link to the wrong page? I see no regexes on that page.
well you have to click around a bit and be prepared to look at the other pages in Pabels series of posts … I linked to this one since I felt it chimes well with the OP
It really is, in so many ways.
It is amazing how successful it's been.
UTF-8 should be a universal tokenizer
I'm just gonna leave this here too: https://www.youtube.com/watch?v=MijmeoH9LT4
What I find inconvenient about emoji characters is the differential length counting in programming languages
That's a problem with programming languages having inconsistent definitions of length. They could be like Swift where the programmer has control over what counts as length one. Or they could decide that the problem shouldn't be solved by the language but by libraries like ICU.
kill Unicode. Done with this after these 25 byte single characters.
Looks similar to midi
> Another one is the ISO/IEC 8859 encodings are single-byte encodings that extend ASCII to include additional characters, but they are limited to 256 characters.
ISO 2022 allowed you to use control codes to switch between ISO 8859 character sets though, allowing for mixed script text streams.
How many llm tokens are wasted everyday resolving utf issues?
I specialize in protocol design, unfortunately. A while ago I had to code some Unicode conversion routines from scratch and I must say I absolutely admire UTF-8. Unicode per se is a dumpster fire, likely because of objective reasons. Dealing with multiple Unicode encodings is a minefield. I even made an angry write-up back then https://web.archive.org/web/20231001011301/http://replicated...
UTF-8 made it all relatively neat back in the day. There are still ways to throw a wrench into the gears. For example, how do you handle UTF-8 encoded surrogate pairs? But at least one can filter that out as suspicious/malicious behavior.
> For example, how do you handle UTF-8 encoded surrogate pairs?
Surrogate pairs aren’t applicable to UTF-8. That part of Unicode block is just invalid for UTF-8 and should be treated as such (parsing error or as invalid characters etc).
In theory, yes. In practice, there are throngs of parsers and converters who might handle such cases differently. https://seriot.ch/projects/parsing_json.html
2 replies →
> Unicode per se is a dumpster fire
Maybe as to emojis, but otherwise, no, Unicode is not a dumpster fire. Unicode is elegant, and all the things that people complain about in Unicode are actually problems in human scripts.
Another collaboration by Pike and Thompson can be seen here: https://go.dev/.
What are the perceived benefits of UTF-16 and 32 and why did they come about?
I could ask Gemini but HN seems more knowledgeable.
UTF-16 is a hack that was invented when it became clear that UCS-2 wasn't gonna work (65536 codepoints was not enough for everybody).
Almost the entire world could have ignored it if not for Microsoft making the wrong choice with Windows NT and then stubbornly insisting that their wrong choice was indeed correct for a couple of decades.
There was a long phase where some parts of Windows understood (and maybe generated) UTF-16 and others only UCS-2.
Besides Microsoft, plenty of others thought UTF-16 to be a good idea. The Haskell Text type used to be based on UTF-16; it only switched to UTF-8 a few years ago. Java still uses UTF-16, but with an ad hoc optimization called CompactStrings to use ISO-8859-1 where possible.
2 replies →
Thank you! That's interesting.
What about UTF-7? That seemed like a bad idea even at the time.
[dead]
[dead]
[dead]
[dead]
meh. it's a brilliant design to put a bandage over a bad design. if a language can't fit into 255 glyphs, it should be reinvented.
Sun Tzu would like a word or two with you.
Now fix fonts! It should be possible to render any valid string in a font.
UTF8 is a horrible design. The only reason it was widely adopted was backwards compatibility with ASCII. There are large number of invalid byte combinations that have to be discarded. Parsing forward is complex even before taking invalid byte combinations in account and parsing backwards is even worse. Compare that to UTF16 where parsing forward and backwards are simpler than UTF8 and if there is invalid surrogate combination, one can assume it is valid UCS2 char.
UTF-16 is an abomination. It's only easy to parse because it's artificially limited to 1 or 2 code units. It's an ugly hack that requires reserving 2048 code points ("surrogates") from the Unicode table just for the encoding itself.
It's also the reason why Unicode has a limit of about 1.1 million code points: without UTF-16, we could have over 2 billion (which is the UTF-8 limit).