Omnilingual ASR: Advancing automatic speech recognition for 1600 languages

11 hours ago (ai.meta.com)

HF Demo: https://huggingface.co/spaces/facebook/omniasr-transcription...

GitHub: https://github.com/facebookresearch/omnilingual-asr

First, let me say that this is impressive. And then let me pose some questions:

As a linguist, I would like to know more about the kinds of languages this works well with, or does not work well with. For example, half the world's languages are tone languages, and the way tones work varies greatly among these. Some just have high and low tones, while others are considerably more complicated; Thai has high, mid, low, rising and falling. Also, tone is relative, e.g. a man's high tone might be a woman's low tone. And some African languages have tones whose absolute frequencies vary across an utterance. So transcribing tone is a quite different problem from transcribing phonemes--and yet for many tone languages, the tone is crucial.

There are also rare(r) phonemes, like the clicks in many languages of southern Africa. Of course maybe they've already trained on some of these languages.

The HuggingFace demo says "Supported Languages[:] For this public demo, we've restricted transcription to low-resource languages with error rates below 10%." That's unclear: 10% word error rate, or character/ phoneme error rate? The meta.com page refers to character error rate (CER); a 10% character error rate can imply a much higher word error rate (WER), since most words contain several characters/ phonemes. That said, there are ways to get around that, like using a dictionary to select among different paths through possible character sequences so you only get known words, and adding to that a morphological parser for languages that have lots of affixes (meaning not all the word forms will be in the dictionary--think walk, walks, walked, walking--only the first will be in most dictionaries.)

Enquiring minds want to know!

This seems like a massive improvement for openly available local ASR. Even the 300M model outperforms whisper-large-v3 according to the paper's benchmarks.

  • Not sure, I recorded 3 seconds of voice (a single sentence) and the hf demo misrecognized about half of the words.

    • And moreover, you can not tune those models for practical applications. The model is originally trained on very clean data, so lower layers are also not very stable for diverse inputs. To finetune you have to update the whole model, not just upper layers.

    • This model is actually expected to be bad for popular languages, just like previous MMS it is not accurate at all, it wins by supporting something rare well but never had good ASR accuracy even for Swedish etc. It is more a research thing than a real tool. Unlike Whisper.

Only a few gb of weights will recognize speech in 1600+ languages.

Freely downloadable and usable by anyone for almost anything.

We truly live in the future.

  • Seeing the absurd number of languages made me think of the norm macdonald joke:

    Music is the universal language, but one day soon it will be replaced by Chinese.

Does anyone else feel like they buried the lead?

> Omnilingual ASR was designed as a community-driven framework. People around the world can extend Omnilingual ASR to new languages by using just a few of their own samples.

The world just got smaller

Just killed my startup. https://6k.ai

Half joking - hopefully, we can still contribute something to this to this field. Looking forward to doing some tests with this.

How hard is it to make TTS out of this? A few independent journalists from Belarus asked for TTS in their language, but I am no expert, was thinking about re-using Mozilla's work. What's the easiest way to get working TTS for a language?

  • EDIT: My bad, please disregard; As akreal pointed out, the MMS TTS models aren’t using the SSL models.

    Original post:

    You can use the OmniASR SSL models instead of their older MMS models to create TTS models: https://github.com/ylacombe/finetune-hf-vits

    • As far as I understand, the MMS TTS models are trained from scratch (section 7.1 of [1]), they do not employ any SSL models. So the OmniASR SSL models are not useful here.

      What might be interesting is the newly released OmniASR data, because the MMS data, which was used for the MMS TTS, was never released.

      Also, the OmniASR can be used to transcribe some untranscribed speech to train a TTS on it.

      [1] MMS paper: https://arxiv.org/pdf/2305.13516

      1 reply →

  • From TFA, it says that it’s extremely easy to add new languages with just a few examples. I didn’t see specifics on how “few” it really is, though.