Comment by jononor

3 days ago

These models will never compete with frontier models and do not need to - it is about hitting a good-enough, not being the best. Behind the frontier, getting to a certain performance level, is getting easier over time - both sample and compute efficiency is going up.

Furthermore one can reuse investments in data (both agreements, infrastructure and datasets), compute (GPUs, servers) and know-how (training scripts, experienced engineers).

2 comments

jononor

embedding-shape 3 days ago

But are you seriously under the belief that all of that, plus all the other things you're forgetting about, is easier, cheaper and faster than transcriptions and translations?

I understand and agree building the LLMs yourself comes with more benefits, long-term ones especially, but still it's harder, more expensive and really time consuming work.

jononor 3 days ago

I do not know which is easier. I am not sure that is even well established in research for generative text tasks whether a translation-first or native-language-first is the most sample efficient?
But for a national lab I think it is money well spent to figure out the possibilities and limitations of a native-language LLMs for languages with order of 5M-10M speakers.