Comment by panpog

1 year ago

Can you fit everything into 32 bits? I have no idea, but Hangul and indict scripts seem like they might have a combinatoric explosion of infrequently used characters.

But they don't have that explosion if you only encode the combinatoric primitives those characters are made of and then use composing rules?

  • You still get the combinatoric explosion, but you have more bits to work with. Imagine if you could combine any 9 jamo into a single hangul syllable block. (The real combinatorics is more complicated, and I don't know if it's this bad.) Encoding just the 24 jamo and a a control character requires 25 codepoints. Giving each syllable block its own codepoint would require 24^9>2^32 codepoints.

    • > Giving each syllable block its own codepoint

      That's the thing - you wouldn't do that! Only a small subset of frequently used combos would get it's own id, the rest would only be composable