Comment by internet_points

9 days ago

> However sceptical of "AI" you are, "give me the information on this page in my preferred language" is the kind of task they excel at.

Except for the 90% or more of the world's 7000-ish languages which have barely any data online.

E.g. the huge CommonCrawl corpus has stats https://commoncrawl.github.io/cc-crawl-statistics/plots/lang... for only 160 languages. English takes up nearly half the corpus, and after the top 16 or so all languages have <1% of the corpus, over half of those 160 have <0.1% and the other 6000+ languages are distributed amongst the <unknown> category. The long tail is very long.

(You'll see people use the term "low-resource language" and then talk about Finnish or Macedonian – if you're not a linguist and you've heard of the language, it's most likely not low-resource ;-))