← Back to context

Comment by rmunn

19 hours ago

> ... utf-8 that covers all the languages but excludes the emojis ...

Ah, but the U+0000 to U+FFFF plane does not cover all the languages. You might think that only historical and archaic languages are found in Unicode's astral planes (e.g., U+20000 to U+2A6DF is used for historical Chinese characters no longer used today), but in fact there are modern languages found in the U+10000 plane.

You might not care about Osage (the language of the Osage Nation of northern Oklahoma) since its last native speaker passed away in 2005, but there is a revival program trying to teach Osage to people. Osage's script was developed quite recently as part of the revival program, so it couldn't fit into the U+0000 to U+FFFF block and it was assigned U+104B0 to U+104FF.

The Toto language of Bengal, on the other hand, is still active: over 1000 speakers, all living in the village of Totopara. It also never had an alphabet until recently, so its Unicode block is U+1E290 to U+1E2BF.

Then there's Wancho, spoken by about 60,000 people in India. Its alphabet was created between 2001 and 2012, and added to Unicode in 2019. It was assigned the U+1E2C0 to U+1E2FF block (immmediately after the Toto language, you might notice).

Then there's the Ho language spoken by over a million people in India. Wikipedia cites a 2001 census as having 2.2 million speakers, and a 2011 census as having 1.4 million speakers. I very much doubt that both of those are accurate (you don't lose half a million people from an ethnic group in just ten years without some kind of war or genocide, and the Wikipedia article would have at least mentioned that if such a thing had happened), but to be safe, let's go with the lower estimate and say that at least one and a half million people speak Ho. It can be written with the Latin alphabet, but its own alphabet is Warang Chiti (sometimes spelled Warang Citi), which was added to Unicode in 2014 and assigned the U+118A0 to U+118FF block.

And then there's the Adlam script for writing Fulani, the language of the Fufulde people of western Africa. Fulani is spoken natively by 37 million people, and as a second language by another 2.7 million. Adlam's Unicode block is U+1E900 to 1+1E95F.

So if you restrict your program to only working with the basic multilingual plane, it's not just emoji you'll be leaving out. It's also modern languages, spoken by anywhere from 1000 people to 37 million. How many speakers of a language are enough to draw the line and say "No, I won't ever translate my software into your language"?

Now, if your software is only targeting one language and you never intend to translate it, then yes, you'll only lose out on emoji if you stick to the U+0000 to U+FFFF range of the basic multilingual plane.

But realize that the higher planes are not just for dead languages. Living languages have ended up there too, and there are likely to be more in the future. It's quite possible that right now, someone somewhere is saying "Hey, why doesn't my language have its own alphabet instead of using Latin characters to write it? The Latin characters don't express the sounds of my language very well." And when they do get that alphabet worked out and manage to get it accepted into Unicode, it'll certainly land in one of the higher planes. Most likely the U+10000 to U+1FFFF plane which isn't at all full yet, but who knows. If you want to be able to handle every language spoken (and written) in the world today, you must be able to accept the full range of Unicode, not just the 16-bit range.

> You don't lose half a million people from an ethnic group in just ten years without some kind of war or genocide.

Nothing happened to the people, they are growing year on year. But languages can die very easily if governments don't put efforts on teaching it to children. That is exactly what happened to the Ho language. There is no advantage on learning these small regional languages so children put their effort on more popular languages like Hindi, Odia and English.

Here is a good article on this topic:

https://www.vogue.in/content/when-languages-in-india-disappe...

  • I'm familiar with the phenomenon, as my wife is a linguist who did her master's thesis on the phonology of a small language spoken by about 7000 people: many of the kids don't want to learn it, and just want to learn the majority language of the country since that's what they have to use in school. But I didn't think that could be the explanation for a 25% decline in ten years: new people may not be learning the language, but the only way people stop speaking their mother tongue is if they immigrate to a new country and fully adapt to it (happens to a few people, usually who immigrated as children) or if they die (by far the most common reason for language-use decline: the old people are dying and the young people aren't learning it). If the decline was a couple hundred thousand that would be the outside limit of probability, as far as I know.

    More likely, in my opinion, is that both are happening: yes, the language is declining, but either the earlier census overcounted speakers (e.g. counting children as speaking it when they weren't actually learning it) or else the later census undercounted speakers; either way the language decline would look larger than it actually is. Given that Ethnologue (https://www.ethnologue.com/language/hoc/) rates the language vitality as "Stable" — "The language is not being sustained by formal institutions, but it is still the norm in the home and community that all children learn and use the language" — and they usually know what they're talking about, I suspect the language decline isn't that fast and a census counting mistake is a more likely explanation for the discrepancy over ten years.