Comment by akolbe

3 years ago

Being freely licensed, the Wikipedia corpus is widely used as input for AI tools such as machine translation.

https://www.theregister.com/2020/08/26/scots_wikipedia_fake/

How does that hurt the language itself? What would it mean for the language to be hurt?

  • Any people wanting to learn the language using these resources (e.g. learning more Scots from the Scots wikipedia) would mean that they learn the altered/false/different language instead of the actual language, and afterward that changes how the actual language is used in practice, propagating the misconceptions onwards and degrading the language.

    In a similar way, if some people online are using they/their/they're interchangeably, and then other people (who are learning the language) learn that they are interchangeable and start using them this way, then English gets hurt by being altered in this undesired way.

    Some changes to language are considered desirable (e.g. introduction of new terminology for new concepts, or restructuring either as a natural process or top-down reforms that makes it more clear and thus more useful for communication) and some are not (ones that increase confusion such as the they/their/they're example above), and the latter ones are considered to hurt the language.

    • > Some changes to language are considered desirable (e.g. introduction of new terminology for new concepts, or restructuring either as a natural process

      The problem is that what you describe in your first paragraph:

      > Any people wanting to learn the language using these resources (e.g. learning more Scots from the Scots wikipedia) would mean that they learn the altered/false/different language instead of the actual language, and afterward that changes how the actual language is used in practice

      is just a description of how languages change as a natural process. It's not different in any way. Concluding that in this case it is "damage" and "bad" would require you to conclude that all natural language change is also "damage" and "bad", which is admittedly a popular viewpoint. But it's one you're trying to disavow.

      3 replies →

  • AI tools and machine translation would result in the propagation on the internet of new prose, purportedly in Scots, but in fact synthesised based on the constructed language of scowiki.

    Real prose in Scots is scarce enough already; if scowiki is used as a corpus for training AI, that is certainly a threat to the language.