Comment by londons_explore
10 days ago
Does this have the ability to edit historic words as more info becomes available?
Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".
Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".
Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.
A good opportunity to point people to the paper with my favorite title of all time:
"How to wreck a nice beach you sing calm incense"
https://dl.acm.org/doi/10.1145/1040830.1040898
For folks like me puzzling over what the correct transcription of the title should be, I think it's "How to recognize speech using common sense"
Thank you! "Calm incense" makes very little sense when said in an accent where calm isn't pronounced like com.
8 replies →
This is the correct parsing of it. (I can't take credit for coming up with the title, but I worked on the project.)
I only got the "How to recognize" part. Also I think "using" should sound more like "you zinc" than "you sing".
Thanks. Now I know that I'm not that stupid and this actually makes no sense
4 replies →
Thank you very much!
The paper: https://sci-hub.st/https://dl.acm.org/doi/10.1145/1040830.10...
(Agree that the title is awesome, by the way!)
Direct PDF download link:
https://web.media.mit.edu/~lieber/Publications/Wreck-a-Nice-...
Fun fact, I just could not work out what this was supposed to be, so I just used Whisper (indirectly, via the FUTO Voice Input app on my phone) and repeated the sentence into it, and it came out with the 'correct' transcription of "How to recognize speech using common sense." first time.
Of course, this is nothing like what I actually said, so... make your own mind up whether that is actually a correct transcription or not!
I have a British accent, for the record.
My favorite is:
"Threesomes, with and without blame"
https://dl.acm.org/doi/10.1145/1570506.1570511
(From a professor I worked with a bit in grad school)
Also relevant: The Two Ronnies - "Four Candles"
https://www.youtube.com/watch?v=gi_6SaqVQSw
Do AI voice recognition still use markov models for this?
Whisper uses an encoder-decoder transformer.
This is what your brain does when it processes language.
I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.
A slight segue to this; I was made aware of the phenomena that - The language in which you think in, sets the constraints to which you level of expanse the brain can think and parse information in.
I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.
│
└── Dey well; Be well
This is called linguist relativity (nee. The Sapir-Whorf hypothesis) and the strong form you describe has fallen out of favour in modern linguistics.
A surprising number of monolingual people think their own language is the most adaptable and modern language, but this is obviously untrue. All languages evolve to fit the needs of speakers.
Also, the idea that people "think in language X" is heavily disputed. One obvious counterargument is that most people have experienced the feeling of being unable to express what they are thinking into words -- if you truly did think in the language you speak, how could this situation happen? My personal experience is that I do not actively hear any language in my head while unless I actively try to think about it (at least, since I was a teenager).
(This is all ignoring the comments about ESL speakers that I struggle to read as anything but racism. As someone who speaks multiple languages, it astounds me how many people seem to think that struggling to express something in your non-native language means that you're struggling to think and are therefore stupid.)
5 replies →
It makes me curious about how human subtitlers or even scriptwriters choose to transcribe intentionally ambiguous speech, puns and narratively important mishearings. It's like you need to subtitle what is heard not what is said.
Do those born profoundly deaf specifically study word sounds in order to understand/create puns, rhymes and such so they don't need assistance understanding narrative mishearings?
It must feel like a form of abstract mathematics without the experiential component... but then I suspect mathematicians manufacture an experiential phenomena with their abstractions with their claims of a beauty like music... hmm!
The quality of subtitles implies that almost no effort is being put into their creation. Watch even a high budget movie/TV show and be aghast at how frequently they diverge.
A good subtitle isn't a perfect copy of what was said.
13 replies →
I had similar thoughts when reading Huck Finn. It's not just phonetically spelled, it's much different. Almost like Twain came up with a list of words, and then had a bunch of 2nd graders tell him the spelling of words they had seen. I guess at some point, you just get good at bad spelling?
Writing in the vernacular, I believe it's called. I do something like that if I'm texting.
The book "Feersum Endjinn" by Iain M. Banks uses something like this for one of its characters to quite good effect.
2 replies →
Whisper works on 30 second chunks. So yes it can do that and that’s also why it can hallucinate quite a bit.
The ffmpeg code seems to default to three second chunks (https://ffmpeg.org/ffmpeg-filters.html#whisper-1):
so if "I scream" is in one chunk, and "is the best dessert" is in the next, then there is no way to edit the first chunk to correct the mistake? That seems... suboptimal!
I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.
The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.
8 replies →
Whisper is excellent, but not perfect.
I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."
Whisper supports adding a context, and if you're transcribing a phone call, you should probably add "Transcribe this phone call with Gem", in which case it would probably transcribe more correctly.
1 reply →
That's at least as good as a human, though. Getting to "better-than-human" in that situation would probably require lots of potentially-invasive integration to allow the software to make correct inferences about who the speakers are in order to spell their names correctly, or manually supplying context as another respondent mentioned.
2 replies →
So, yes, and also no.
I recommend having a look at 16.3 onward here if you're curious about this: https://web.stanford.edu/~jurafsky/slp3/16.pdf
I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".
I Scream in the Sun https://carmageddon.fandom.com/wiki/I_Scream_in_the_Sun
The I is emphasized more in I scream than ice cream I think.
But it’s great point that you need context to be sure.
what would it make of this? https://www.youtube.com/watch?v=zyvZUxnIC3k