Comment by jampekka
19 days ago
What do you mean by relying on Google?
Llama 3.1 and DeepSeek v3/R1 largest models are rather good at even a niche language like Finnish. The performance does plummet in the smaller versions, and even quantization may harm multilinguality disproportionally.
Something like deliberately distilling specific languages from the largest models could work well. Starting from scratch with a "legal" dataset will most likely fail as you say.
Silo AI (co-lead of this model) already tried Finnish and Scandinavian/Nordic models with the from-scratch strategy, and the results are not too encouraging.
Yes I think small languages which have a total corpus of maybe a few hundred million tokens total have no chance of producing a coherent model without synthetic data. And using synthetic data from existing models trained on all public (and less public) data is enough of a legally gray area that I wouldn't expect this project to consider it, so it's doomed before it even starts.
Something like 4o is so perfect in most languages that one could just make an infinite dataset from it and be done with it. I'm not sure how OAI managed it tbh.