← Back to context

Comment by jononor

3 days ago

These models will never compete with frontier models and do not need to - it is about hitting a good-enough, not being the best. Behind the frontier, getting to a certain performance level, is getting easier over time - both sample and compute efficiency is going up.

Furthermore one can reuse investments in data (both agreements, infrastructure and datasets), compute (GPUs, servers) and know-how (training scripts, experienced engineers).

But are you seriously under the belief that all of that, plus all the other things you're forgetting about, is easier, cheaper and faster than transcriptions and translations?

I understand and agree building the LLMs yourself comes with more benefits, long-term ones especially, but still it's harder, more expensive and really time consuming work.

  • I do not know which is easier. I am not sure that is even well established in research for generative text tasks whether a translation-first or native-language-first is the most sample efficient?

    But for a national lab I think it is money well spent to figure out the possibilities and limitations of a native-language LLMs for languages with order of 5M-10M speakers.