Comment by albertzeyer

6 days ago

That's not quite true. State of the art both in speech recognition and translation is still a dedicated model only for this task alone. Although the gap is getting smaller and smaller, and it also heavily depends on who invests how much training budget.

For example, for automatic speech recognition (ASR), see: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

The current best ASR model has 600M params (tiny compared to LLMs, and way faster than any LLM: 3386.02 RTFx vs 62.12 RTFx, much cheaper) and was trained on 120,000h of speech. In comparison, the next best speech LLM (quite close in WER, but slightly worse) has 5.6B params and was trained on 5T tokens, 2.3M speech hours. It has been always like this: With a fraction of the cost, you will get a pure ASR model which still beats every speech LLM.

The same is true for translation models, at least when you have enough training data, so for popular translation pairs.

However, LLMs are obviously more powerful in what they can do despite just speech recognition or translation.

20 comments

albertzeyer

edflsafoiewq 6 days ago

What translation models are better than LLMs?

The problem with Google-Translate-type models is the interface is completely wrong. Translation is not sentence->translation, it's (sentence,context)->translation (or even (sentence,context)->(translation,commentary)). You absolutely have to be able to input contextual information, instructions about how certain terms are to be translated, etc. This is trivial with an LLM.

thatjoeoverthr 6 days ago
This is true, and LLMs crush Google in many translation tasks, but they do too many other things. They can and do go off script, especially if they "object" to the content being translated.
"As a safe AI language model, I refuse to translate this" is not a valid translation of "spierdalaj".
- selfhoster11 6 days ago
  
  That's literally an issue with the tool being made defective by design by the manufacturer. Not with the tool-category itself.
  
  3 replies →
- raphlinus 6 days ago
  
  The converse, however, is a different story. "Spierdalaj" is quite a good translation of "As a safe AI language model, I refuse to translate this."
- bird0861 5 days ago
  
  One would have to be absolutely cooked to consider using a censored model to translate or talk about anything a preschooler's ears can't hear.
  There are plenty of uncensored models that will run on less than 8GB of vram.
- ifdefdebug 6 days ago
  
  haha that word. back in the 80ies,some polish friends of mine taught me that but refused to tell me what it meant and instructed me to never, ever use it. Until today I don't know what it is about...
gpm 6 days ago
I've been using small local LLMs for translation recently (<=7GB total vram usage) and they, even the small ones, definitely beat Google Translate in my experience. And they don't require sharing whatever I'm reading with Google, which is nice.
- yubblegum 6 days ago
  
  What are you using? whisper?
  
  2 replies →
albertzeyer 6 days ago

I'm not sure what type of model Google uses nowadays for their webinterface. I know that they also actually provide LLM-based translation via their API.
Also the traditional cross-attention-based encoder-decoder translation models support document-level translation, and also with context. And Google definitely has all those models. But I think the Google webinterface has used much weaker models (for whatever reason; maybe inference costs?).
I think DeepL is quite good. For business applications, there is Lilt or AppTek and many others. They can easily set up a model for you that allows you to specify context, or be trained for some specific domain, e.g. medical texts.
I don't really have a good reference for a similar leaderboard for translation models. For translation, the metric to measure the quality is anyway much more problematic than for speech recognition. I think for the best models, only human evaluation is working well now.

BeetleB 6 days ago

It's not the speech recognition model alone that's fantastic. It's coupling it to an LLM for cleanup that makes all the difference.

See https://blog.nawaz.org/posts/2023/Dec/cleaning-up-speech-rec...

(This is not the best example as I gave it free rein to modify the text - I should post a followup that has an example closer to a typical use of speech recognition).

Without that extra cleanup, Whisper is simply not good enough.

Terr_ 6 days ago

> However, LLMs are obviously more powerful in what they can do despite just speech recognition

Unfortunately, one of those powerful features is "make up new things that fit well but nobody actually said", and... well, there's no way to disable it. :p

pants2 6 days ago

That leaderboard omits the current SOTA which is GPT-4o-transcribe (an LLM)

albertzeyer 6 days ago
Do you have any comparisons in terms of WER? I doubt that GPT-4o-transcribe is better than the best models from that leaderboard (https://huggingface.co/spaces/hf-audio/open_asr_leaderboard). A quick search on this got me here: https://www.reddit.com/r/OpenAI/comments/1jvdqty/gpt4otransc... https://scribewave.com/blog/openai-launches-gpt-4o-transcrib...
It is stated that GPT-4o-transcribe is better than Whisper-large. That might be true, but what version of Whisper-large actually exactly? Looking at the leaderboard, there are a lot of Whisper variants. But anyway, the best Whisper variant, CrisperWhisper, is currently only at rank 5. (I assume GPT-4o-transcribe was not compared to that but to some other Whisper model.)
It is stated that Scribe v1 from elevenlabs is better than GPT-4o-transcribe. In the leaderboard, Scribe v1 is also only at rank 6.
- pzo 6 days ago
  
  you have image with WER on openai blog post here: https://openai.com/index/introducing-our-next-generation-aud...
  On their chart they compare also with: gemini 2.0 flash, whisper large v2, whisper large v3, scribe v1, nova 1, nova 2. If you need only english transcription then pretty much all models will be good these days but big difference is depending on input language.