← Back to context

Comment by mort96

15 hours ago

The silly thing is, lots of emoji these days aren't even a single code point. So many emoji these days are two other code points combined with a zero width joiner. Surely we could've introduced one code point which says "the next code point represents an emoji from a separate emoji set"?

With that approach you could no longer look at a single code point and decide if it's e.g. a space. You would always have to look back at the previous code point to see if you are now in the emoji set. That would bring its own set of issues for tools like grep.

But what if instead of emojis we take the CJK set and make it more compositional. Instead of >100k characters with different glyphs we could have defined a number of brush stroke characters and compositional characters (like "three of the previous character in a triangle formation). We could still make distinct code points for the most common couple thousand characters, just like ä can be encoded as one code point or two (umlaut dots plus a).

Alas, in the 90s this would have been seen as too much complexity