Comment by moffkalast

9 months ago

Yes I think small languages which have a total corpus of maybe a few hundred million tokens total have no chance of producing a coherent model without synthetic data. And using synthetic data from existing models trained on all public (and less public) data is enough of a legally gray area that I wouldn't expect this project to consider it, so it's doomed before it even starts.

Something like 4o is so perfect in most languages that one could just make an infinite dataset from it and be done with it. I'm not sure how OAI managed it tbh.

0 comments

moffkalast

No comments yet

Contribute on Hacker News ↗