Comment by bawolff

9 days ago

> As a result, WMF prioritizes investing in emerging markets over enwiki. This means outreach to indigenous languages in the Global South and developing supporting infrastructure. e.g. "Abstract Wikipedia" which aims to use a language-neutral syntax that can be automatically translated into any language.

I'd disagree that there is a causal relationship here. I think most of the outreach to indigneous languages has more to do with politics and ideology than anything else (Wikimedia sees itself as a global movement to collect all knowladge. Can't exactly claim that if its all english).

As for abstract wikipedia. I think that is more a moonshot project driven by people wanting to make the next wikidata. I suspect a major part of support for it is that they can use alternative sources of funding for it (grants).

The "abstract Wikipedia" just seems like a solved problem with LLMs.

However sceptical of "AI" you are, "give me the information on this page in my preferred language" is the kind of task they excel at. (I won't use the word translate). It wouldn't even require prioritising the English Wikipedia: any agent today could one shot a task like "check the Wikipedia pages in all languages for X, summarize the results and note any disagreements between them".

  • Abstract wikipedia is taking a symbolic AI approach instead of an LLM or other statistical approach. The hope is (as i understand it) that this will provide reliability, predictability and better extend to languages that don't have a large corpus of text to train things on.

    Personally i think its a bit of a wild bet, that seems especially surprising in the modern context. Guess we'll have to see if it pans out.

    • I'd kind of expect that they do better with translation if one of the languages on either end is English due to the amount of input they get in it compared to this abstract language (even in the world of human translation, translating stuff into English as a "pivot language" and then doing every translation from the English translation rather than the original text is not an uncommon practice).

  • > However sceptical of "AI" you are, "give me the information on this page in my preferred language" is the kind of task they excel at.

    Except for the 90% or more of the world's 7000-ish languages which have barely any data online.

    E.g. the huge CommonCrawl corpus has stats https://commoncrawl.github.io/cc-crawl-statistics/plots/lang... for only 160 languages. English takes up nearly half the corpus, and after the top 16 or so all languages have <1% of the corpus, over half of those 160 have <0.1% and the other 6000+ languages are distributed amongst the <unknown> category. The long tail is very long.

    (You'll see people use the term "low-resource language" and then talk about Finnish or Macedonian – if you're not a linguist and you've heard of the language, it's most likely not low-resource ;-))

  •   > give me the information on this page in my preferred language
    

    I'm sure that works great for European languages and other languages with huge corpus. Those are not the target languages of the program in question.

    • LLMs are great with minority languages compared to almost anything else. Including better than the by the natural language generation employed to use Abstract Wikipedia, which whiffs at relatively large languages like Zulu and Xhosa, let alone many of the rarer languages that popular LLMs speak fluently.

      3 replies →

  • It's not a good idea for common languages like German or English or French.

    But it is a great idea for indigenous languages that aren't in the training data but many people speak, which was the original purpose.

    I am hopeful that it'll create synthetic training data for those groups.

  • > "give me the information on this page in my preferred language" is the kind of task they excel at.

    ...So long as you don't mind it introducing random hallucinations into the information.

    • Biased by its training data to boot, the opposite of what you’d want in an encyclopedia.