Comment by retrac
6 days ago
Machine translation and speech recognition. The state of the art for these is a multi-modal language model. I'm hearing impaired veering on deaf, and I use this technology all day every day. I wanted to watch an old TV series from the 1980s. There are no subtitles available. So I fed the show into a language model (Whisper) and now I have passable subtitles that allow me to watch the show.
Am I the only one who remembers when that was the stuff of science fiction? It was not so long ago an open question if machines would ever be able to transcribe speech in a useful way. How quickly we become numb to the magic.
That's not quite true. State of the art both in speech recognition and translation is still a dedicated model only for this task alone. Although the gap is getting smaller and smaller, and it also heavily depends on who invests how much training budget.
For example, for automatic speech recognition (ASR), see: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
The current best ASR model has 600M params (tiny compared to LLMs, and way faster than any LLM: 3386.02 RTFx vs 62.12 RTFx, much cheaper) and was trained on 120,000h of speech. In comparison, the next best speech LLM (quite close in WER, but slightly worse) has 5.6B params and was trained on 5T tokens, 2.3M speech hours. It has been always like this: With a fraction of the cost, you will get a pure ASR model which still beats every speech LLM.
The same is true for translation models, at least when you have enough training data, so for popular translation pairs.
However, LLMs are obviously more powerful in what they can do despite just speech recognition or translation.
What translation models are better than LLMs?
The problem with Google-Translate-type models is the interface is completely wrong. Translation is not sentence->translation, it's (sentence,context)->translation (or even (sentence,context)->(translation,commentary)). You absolutely have to be able to input contextual information, instructions about how certain terms are to be translated, etc. This is trivial with an LLM.
This is true, and LLMs crush Google in many translation tasks, but they do too many other things. They can and do go off script, especially if they "object" to the content being translated.
"As a safe AI language model, I refuse to translate this" is not a valid translation of "spierdalaj".
8 replies →
I've been using small local LLMs for translation recently (<=7GB total vram usage) and they, even the small ones, definitely beat Google Translate in my experience. And they don't require sharing whatever I'm reading with Google, which is nice.
3 replies →
I'm not sure what type of model Google uses nowadays for their webinterface. I know that they also actually provide LLM-based translation via their API.
Also the traditional cross-attention-based encoder-decoder translation models support document-level translation, and also with context. And Google definitely has all those models. But I think the Google webinterface has used much weaker models (for whatever reason; maybe inference costs?).
I think DeepL is quite good. For business applications, there is Lilt or AppTek and many others. They can easily set up a model for you that allows you to specify context, or be trained for some specific domain, e.g. medical texts.
I don't really have a good reference for a similar leaderboard for translation models. For translation, the metric to measure the quality is anyway much more problematic than for speech recognition. I think for the best models, only human evaluation is working well now.
It's not the speech recognition model alone that's fantastic. It's coupling it to an LLM for cleanup that makes all the difference.
See https://blog.nawaz.org/posts/2023/Dec/cleaning-up-speech-rec...
(This is not the best example as I gave it free rein to modify the text - I should post a followup that has an example closer to a typical use of speech recognition).
Without that extra cleanup, Whisper is simply not good enough.
> However, LLMs are obviously more powerful in what they can do despite just speech recognition
Unfortunately, one of those powerful features is "make up new things that fit well but nobody actually said", and... well, there's no way to disable it. :p
That leaderboard omits the current SOTA which is GPT-4o-transcribe (an LLM)
Do you have any comparisons in terms of WER? I doubt that GPT-4o-transcribe is better than the best models from that leaderboard (https://huggingface.co/spaces/hf-audio/open_asr_leaderboard). A quick search on this got me here: https://www.reddit.com/r/OpenAI/comments/1jvdqty/gpt4otransc... https://scribewave.com/blog/openai-launches-gpt-4o-transcrib...
It is stated that GPT-4o-transcribe is better than Whisper-large. That might be true, but what version of Whisper-large actually exactly? Looking at the leaderboard, there are a lot of Whisper variants. But anyway, the best Whisper variant, CrisperWhisper, is currently only at rank 5. (I assume GPT-4o-transcribe was not compared to that but to some other Whisper model.)
It is stated that Scribe v1 from elevenlabs is better than GPT-4o-transcribe. In the leaderboard, Scribe v1 is also only at rank 6.
1 reply →
> Am I the only one who remembers when that was the stuff of science fiction?
Would you go to a foreign country and sign a work contract based on the LLM translation ?
Would you answer a police procedure based on the speech recognition alone ?
That to me was the promise of the science fiction. Going to another planet and doing inter-species negotiations based on machine translation. We're definitely not there IMHO, and I wouldn't be surprised if we don't quite get there in our lifetime.
Otherwise if we're lowering the bar, speech to text has been here for decades, albeit clunky and power hungry. So improvements have been made, but watching old movies is a way too low stake situation IMHO.
this is very dismissive and binary and thats what this whole article is about. AI skeptic expect that's either AGI or perfect with all use cases or otherwise its useless. SST, translation and TTS went really far away in last 2 years. My mother who doesn't speak english find it very useful when she my sister in US. I find it super useful while travelling in asia. Definitely much more useful than what we had in Google Translate.
I'd understand calling it dismissive if your didn't choose as counterpoints:
- your mother visiting your sister (arguably extremely low stake. At any moment she can just phone your sister I presume ?)
- You traveling around (you're not trying to close a business deal or do anything irreversible)
Basically you seem to be agreeing that it's fine for convenience, but not ready for "science fiction" level use cases.
We have the tools to do this, and will have commercial products for everything you listed in the next couple years.
> Machine translation and speech recognition.
Yes, yes and yes!
I tried speech recognition many times over the years (Dragon, etc). Initially they all were "Wow!", but they simply were not good enough to use. 95% accuracy is not good enough.
Now I use Whisper to record my voice, and have it get passed to an LLM for cleanup. The LLM contribution is what finally made this feasible.
It's not perfect. I still have to correct things. But only about a tenth of the time I used to. When I'm transcribing notes for myself, I'm at the point I don't even bother verifying the output. Small errors are OK for my own notes.
Have they solved the problem of Whisper making up plausible sounding junk (e.g. such that reading it you would have no idea it was completely hallucinated) when there is any silence or pause in the audio?
Nope, but I've noticed it tends to hallucinate the same set of phrases, so I have the LLM remove them.
I completely agree that technology in the last couple years has genuinely been fulfilling the promise established in my childhood sci-fi.
The other day, alone in a city I'd never been to before, I snapped a photo of a bistro's daily specials hand-written on a blackboard in Chinese, copied the text right out of the photo, translated it into English, learned how to pronounce the menu item I wanted, and ordered some dinner.
Two years ago this story would have been: notice the special board, realize I don't quite understand all the characters well enough to choose or order, and turn wistfully to the menu to hopefully find something familiar instead. Or skip the bistro and grab a pre-packaged sandwich at a convenience store.
> I snapped a photo of a bistro's daily specials hand-written on a blackboard in Chinese, copied the text right out of the photo, translated it into English, learned how to pronounce the menu item I wanted, and ordered some dinner.
> Two years ago
This functionality was available in 2014, on either an iPhone or android. I ordered specials in Taipei way before Covid. Here's the blog post celebrating it:
https://blog.google/products/translate/one-billion-installs/
This is all a post about AI, hype, and skepticism. In my childhood sci-fi, the idea of people working multiple jobs to still not be able to afford rent was written as shocking or seen as dystopian. All this incredible technology is a double edges sword, but doesn't solve the problems of the day, only the problems of business efficiency, which exacerbates the problems of the day.
It was available as early as 2012, probably earlier as IIRC Microsoft was copying:
https://www.pcworld.com/article/470008/bing_translator_app_g...
The part of that google translate announcement that covered translating handwritten Chinese must have gone missing
>The other day, alone in a city I'd never been to before, I snapped a photo of a bistro's daily specials hand-written on a blackboard in Chinese, copied the text right out of the photo, translated it into English, learned how to pronounce the menu item I wanted, and ordered some dinner.
To be fair apps dedicated apps like Pleco have supported things like this for 6+ years, but the spread of modern language models has made it more accessible
[flagged]
Definitely not. I took this same basic idea of feeding videos into Whisper to get SRT subtitles and took it a step further to make automatic Anki flashcards for listening practice in foreign languages [1]. I literally feel like I'm living in the future every time I run across one of those cards from whatever silly Finnish video I found on YouTube pops up in my queue.
These models have made it possible to robustly practice all 4 quadrants of language learning for most common languages using nothing but a computer, not just passive reading. Whisper is directly responsible for 2 of those quadrants, listening and speaking. LLMs are responsible for writing [2]. We absolutely live in the future.
[1]: https://github.com/hiandrewquinn/audio2anki
[2]: https://hiandrewquinn.github.io/til-site/posts/llm-tutored-w...
Hi Andrew, I've been trying to get a similar audio language support app hacked together in a podcast player format (I started with Anytime Player) using some of the same principles in your project (transcript generation, chunking, level & obscurity aware timestamped hints and translations).
I really think support for native content is the ideal way to learn for someone like me, especially with listening.
Thanks for posting and good luck.
Translation seems like the ideal application. It seems as though an LLM would truly have no issues integrating societal concepts, obscure references, pop culture, and more, and be able to compare it across culture to find a most-perfect translation. Even if it has to spit out three versions to perfectly communicate, it’s still leaps and bounds ahead of traditional translators already.
> it’s still leaps and bounds ahead of traditional translators already
Traditional machine translators, perhaps. Human translation is still miles ahead when you actually care about the quality of the output. But for getting a general overview of a foreign-language website, translating a menu in a restaurant, or communicating with a taxi driver? Sure, LLMs would be a great fit!
I should’ve been more clear that this is basically what I meant! The availability of the LLM is the real killer because yeah - most translation jobs are needed for like 15 minutes in a relatively low-stakes environment. Perfect for LLMs. That complex stuff will come later when verifiability is possible and fast.
Modern machine translators have been good enough for a few years now, to do business far more complicated than ordering food. I do business every day with people in foreign languages, using these tools. They are reliable.
>Human translation is still miles ahead when you actually care about the quality of the output.
The current SOTA LLMs are better than Traditional machine translators (there is no perhaps) and most human translators.
If a 'general overview' is all you think they're good for, then you've clearly not seriously used them.
1 reply →
> It seems as though an LLM would truly have no issues integrating societal concepts, obscure references, pop culture, and more, and be able to compare it across culture to find a most-perfect translation.
Somehow LLMs can't do that for structured code with well defined semantics, but sure, they will be able to extract "obscure references" from speech/text
All these people who think this technology is already done evolving are so confusing. This has nothing to do with my statement even if it weren’t misleading to begin with.
There is really not that much similar between trying to code and trying to translate emotion. At the very least, language “compiles” as long as the words are in a sensible order and maintain meaning across the start and finish.
All they need to do now in order to be able to translate well is to have contextual knowledge to inform better responses on the translated end. They’ve been doing that for years, so I really don’t know what you’re getting at here.
2 replies →
I started watching Leverage, the TV show, on Amazon, and the subtitles in the early series are clearly AI generated or just bad by default.
I use subtitles because I don’t want to micromanage the volume on my TV when adverts are forced on me and they are 100x louder than than what I was watching.
Old TV series should have closed captions available (which are apparently different from subtitles), however the question of where to obtain aside from VHS copies them might be difficult.
And of course, a lot of modern "dvd players" do not properly transmit closed captions as subtitles over HDMI, so that sure isn't helping
A slightly off topic but interesting video about this https://www.youtube.com/watch?v=OSCOQ6vnLwU
Many DVDs of old movies and TV shows may contain the closed captions, but they are not visible through HDMI. You have to connect your DVD player to your TV via the composite video analogue outputs.
This video explains all about it: https://youtu.be/OSCOQ6vnLwU
Yes they need to be "burned in" to the picture to work with HDMI (he shows a couple of bluray players towards the end that do this. there's also some models mentioned in the comments)
Last time I used Whisper with a foreign language (Chinese) video, I’m pretty sure it just made some stuff up.
The captions looked like they would be correct in context, but I could not cross-reference them with snippets of manually checked audio, to the best of my ability.
I tried whisper with a movie from the 60's and it was a disaster.
Not sure if it was due to the poor quality of the sound, the fact people used to speak a bit differently 60 years ago or that 3 different languages were used (plot took place in France during WW2).
I feel you. In the late 00's/early 10's, downloading and getting American movies were fairly easy but getting the subtitles was a challenge. It was even worse with movies from other regions. Even now I know people that record conversations to be replayed using Whisper so they can get 100% the info from it.
Disclaimer: I'm not praising piracy but outside of US borders is a free for all.
Using AI to generate subtitles is inventive. Is it smart enough to insert the time codes such that the subtitle is well enough synchronised to the spoken line?
As someone who has started losing the higher frequencies and thus clarity, I have subtitles on all the time just so I don't miss dialogue. The only pain point is when the subtitles (of the same language) are not word-for-word with the spoken line. The discordance between what you are reading and hearing is really distracting.
This is my major peeve with my The West Wing DVDs, where the subtitles are often an abridgement of the spoken line.
> Is it smart enough to insert the time codes such that the subtitle is well enough synchronised to the spoken line?
Yes, Whisper has been able to do this since the first release. At work we use it to live-transcribe-and-translate all-hands meetings and it works very well.
I don't think you are also including having AI lie of "hallucinating" to us which is an important point even if the article is only about having AI write code for an organization.
What is the relevance of this comment? The post is about LLMs in programming. Not about translation or NLP, two things transformers do quite well and that hardly anyone contests.
would be interesting if court transcriptions can be handled by these models.