← Back to context

Comment by thaumasiotes

3 years ago

> and damage to the language itself!

What could this mean? Sometimes people do talk about the health of a language, but that almost exclusively refers to the number of living speakers.

Being freely licensed, the Wikipedia corpus is widely used as input for AI tools such as machine translation.

https://www.theregister.com/2020/08/26/scots_wikipedia_fake/

  • How does that hurt the language itself? What would it mean for the language to be hurt?

    • Any people wanting to learn the language using these resources (e.g. learning more Scots from the Scots wikipedia) would mean that they learn the altered/false/different language instead of the actual language, and afterward that changes how the actual language is used in practice, propagating the misconceptions onwards and degrading the language.

      In a similar way, if some people online are using they/their/they're interchangeably, and then other people (who are learning the language) learn that they are interchangeable and start using them this way, then English gets hurt by being altered in this undesired way.

      Some changes to language are considered desirable (e.g. introduction of new terminology for new concepts, or restructuring either as a natural process or top-down reforms that makes it more clear and thus more useful for communication) and some are not (ones that increase confusion such as the they/their/they're example above), and the latter ones are considered to hurt the language.

      4 replies →

    • AI tools and machine translation would result in the propagation on the internet of new prose, purportedly in Scots, but in fact synthesised based on the constructed language of scowiki.

      Real prose in Scots is scarce enough already; if scowiki is used as a corpus for training AI, that is certainly a threat to the language.