← Back to context

Comment by lostmsu

8 hours ago

Not sure, I recorded 3 seconds of voice (a single sentence) and the hf demo misrecognized about half of the words.

And moreover, you can not tune those models for practical applications. The model is originally trained on very clean data, so lower layers are also not very stable for diverse inputs. To finetune you have to update the whole model, not just upper layers.

This model is actually expected to be bad for popular languages, just like previous MMS it is not accurate at all, it wins by supporting something rare well but never had good ASR accuracy even for Swedish etc. It is more a research thing than a real tool. Unlike Whisper.