Comment by miros_love
5 days ago
For example: Slovene language. You simply don't have enough data on it. But if you add all the data that is available on related languages, you will get a higher quality. LLM fails with this property for low-resource languages.
I'm not sure I'm convinced. I speak a small European language and the general experience is that LLMs are often wrong exactly because they think they can just borrow from a related language. The result is even worse and often makes no sense whatsoever. In other words, as far as translations go, confidently incorrect is not useful.
They train on 14 billion tokens in Slovene. Are you sure that's not enough?
Unfortunately, yes.
We need more tokens, more variety of topics in texts and more complexity.
We need one-shot learning.
(That amount is equivalent to 50000 books, which few nationals will have read.)