← Back to context

Comment by rowls66

3 days ago

I think more effort should have been made to live with 65,536 characters. My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis. I think that adding emojis to unicode is going to be seen a big mistake. We already have enough network bandwith to just send raster graphics for images in most cases. Cluttering the unicode codespace with emojis is pointless.

You are mistaken. Chinese Hanzi and the languages that derive from or incorporate them require way more than 65,536 code points. In particular a lot of these characters are formal family or place names. USC-2 failed because it couldn't represent these, and people using these languages justifiably objected to having to change how their family name is written to suit computers, vs computers handling it properly.

This "two bytes should be enough" mistake was one of the biggest blind spots in Unicode's original design, and is cited as an example of how standards groups can have cultural blind spots.

  • UTF-16 also had a bunch of unfortunate ramifications on the overall design of Unicode, e.g. requiring a substantial chunk of BMP to be reserved for surrogate characters and forcing Unicode codepoints to be limited to U+10FFFF.

CJK unification (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs) i.e. combining "almost same" Chinese/Japanese/Korean characters into the same codepoint, was done for this reason, and we are now living with the consequence that we need to load separate Traditional/Simplified Chinese, Japanese, and Korean fonts to render each language. Total PITA for apps that are multi-lingual.

Your understanding is incorrect; a substantial number of the ranges allocated outside BMP (i.e. above U+FFFF) are used for CJK ideographs which are uncommon, but still in use, particularly in names and/or historical texts.

The silly thing is, lots of emoji these days aren't even a single code point. So many emoji these days are two other code points combined with a zero width joiner. Surely we could've introduced one code point which says "the next code point represents an emoji from a separate emoji set"?

  • With that approach you could no longer look at a single code point and decide if it's e.g. a space. You would always have to look back at the previous code point to see if you are now in the emoji set. That would bring its own set of issues for tools like grep.

    But what if instead of emojis we take the CJK set and make it more compositional. Instead of >100k characters with different glyphs we could have defined a number of brush stroke characters and compositional characters (like "three of the previous character in a triangle formation). We could still make distinct code points for the most common couple thousand characters, just like ä can be encoded as one code point or two (umlaut dots plus a).

    Alas, in the 90s this would have been seen as too much complexity

    • Seeing your handle I am at risk of explaining something you may already know, but, this exists! And it was standardized in 1993, though I don't know when Unicode picked it up.

      Ideographic Description Characters: https://www.unicode.org/charts/PDF/U2FF0.pdf

      The fine people over at Wenlin actually have a renderer that generates characters based on this sort of programmatic definition, their Character Description Language: https://guide.wenlininstitute.org/wenlin4.3/Character_Descri... ... in many cases, they are the first digital renderer for new characters that don't yet have font support.

      Another interesting bit, the Cantonese linguist community I regularly interface with generally doesn't mind unification. It's treated the same as a "single-storey a" (the one you write by hand) and a "two-storey a" (the one in this font). Sinitic languages fractured into families in part because the graphemes don't explicitly encode the phonetics + physical distance, and the graphemes themselves fractured because somebody's uncle had terrible handwriting.

      I'm in Hong Kong, so we use 説 (8AAC, normalized to 8AAA) while Taiwan would use 說 (8AAA). This is a case my linguist friends consider a mistake, but it happened early enough that it was only retroactively normalized. Same word, same meaning, grapheme distinct by regional divergence. (I think we actually have three codepoints that normalize to 8AAA because of radical variations.)

      The argument basically reduces "should we encode distinct graphemes, or distinct meanings." Unicode has never been fully-consistent on either side of that. The latest example, we're getting ready to do Seal Script as a separate non-unified code point. https://www.unicode.org/roadmaps/tip/

      In Hong Kong, some old government files just don't work unless you have the font that has the specific author's Private Use Area mapping (or happen to know the source encoding and can re-encode it). I've regularly had to pull up old Windows in a VM to grab data about old code pages.

      In short: it's a beautiful mess.

I entirely agree that we could've cared better for the leading 16 bit space. But protocol-wise adding a second component (images) to the concept of textual strings would've been a terrible choice.

The grande crime was that we squandered the space we were given by placing emojis outside the UTF-8 specification, where we already had a whooping 1.1 million code points at our disposal.

  • > The grande crime was that we squandered the space we were given by placing emojis outside the UTF-8 specification

    I'm not sure what you mean by this. The UTF-8 specification was written long before emoji were included in Unicode, and generally has no bearing on what characters it's used to encode.