Comment by londons_explore

10 days ago

Does this have the ability to edit historic words as more info becomes available?

Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".

Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".

Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.

76 comments

londons_explore

yvdriess 10 days ago

A good opportunity to point people to the paper with my favorite title of all time:

"How to wreck a nice beach you sing calm incense"

https://dl.acm.org/doi/10.1145/1040830.1040898

abound 10 days ago
For folks like me puzzling over what the correct transcription of the title should be, I think it's "How to recognize speech using common sense"
- strken 10 days ago
  
  Thank you! "Calm incense" makes very little sense when said in an accent where calm isn't pronounced like com.
  
  8 replies →
- wdaher 10 days ago
  
  This is the correct parsing of it. (I can't take credit for coming up with the title, but I worked on the project.)
- codedokode 10 days ago
  
  I only got the "How to recognize" part. Also I think "using" should sound more like "you zinc" than "you sing".
- efilife 10 days ago
  
  Thanks. Now I know that I'm not that stupid and this actually makes no sense
  
  4 replies →
- fiatjaf 10 days ago
  
  Thank you very much!
fmx 10 days ago

The paper: https://sci-hub.st/https://dl.acm.org/doi/10.1145/1040830.10...
(Agree that the title is awesome, by the way!)
this_steve_j 2 days ago

Direct PDF download link:
https://web.media.mit.edu/~lieber/Publications/Wreck-a-Nice-...
Sophira 9 days ago

Fun fact, I just could not work out what this was supposed to be, so I just used Whisper (indirectly, via the FUTO Voice Input app on my phone) and repeated the sentence into it, and it came out with the 'correct' transcription of "How to recognize speech using common sense." first time.
Of course, this is nothing like what I actually said, so... make your own mind up whether that is actually a correct transcription or not!
I have a British accent, for the record.
xyse53 10 days ago

My favorite is:
"Threesomes, with and without blame"
https://dl.acm.org/doi/10.1145/1570506.1570511
(From a professor I worked with a bit in grad school)
ThinkingGuy 10 days ago

Also relevant: The Two Ronnies - "Four Candles"
https://www.youtube.com/watch?v=gi_6SaqVQSw
brcmthrowaway 10 days ago
Do AI voice recognition still use markov models for this?
- sva_ 10 days ago
  
  Whisper uses an encoder-decoder transformer.

DiogenesKynikos 10 days ago

This is what your brain does when it processes language.

I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.

mockingloris 10 days ago
A slight segue to this; I was made aware of the phenomena that - The language in which you think in, sets the constraints to which you level of expanse the brain can think and parse information in.
I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.
│
└── Dey well; Be well
- cyphar 10 days ago
  
  This is called linguist relativity (nee. The Sapir-Whorf hypothesis) and the strong form you describe has fallen out of favour in modern linguistics.
  A surprising number of monolingual people think their own language is the most adaptable and modern language, but this is obviously untrue. All languages evolve to fit the needs of speakers.
  Also, the idea that people "think in language X" is heavily disputed. One obvious counterargument is that most people have experienced the feeling of being unable to express what they are thinking into words -- if you truly did think in the language you speak, how could this situation happen? My personal experience is that I do not actively hear any language in my head while unless I actively try to think about it (at least, since I was a teenager).
  (This is all ignoring the comments about ESL speakers that I struggle to read as anything but racism. As someone who speaks multiple languages, it astounds me how many people seem to think that struggling to express something in your non-native language means that you're struggling to think and are therefore stupid.)
  
  5 replies →

Fluorescence 10 days ago

It makes me curious about how human subtitlers or even scriptwriters choose to transcribe intentionally ambiguous speech, puns and narratively important mishearings. It's like you need to subtitle what is heard not what is said.

Do those born profoundly deaf specifically study word sounds in order to understand/create puns, rhymes and such so they don't need assistance understanding narrative mishearings?

It must feel like a form of abstract mathematics without the experiential component... but then I suspect mathematicians manufacture an experiential phenomena with their abstractions with their claims of a beauty like music... hmm!

0cf8612b2e1e 10 days ago
The quality of subtitles implies that almost no effort is being put into their creation. Watch even a high budget movie/TV show and be aghast at how frequently they diverge.
- smallpipe 10 days ago
  
  A good subtitle isn't a perfect copy of what was said.
  
  13 replies →
dylan604 10 days ago
I had similar thoughts when reading Huck Finn. It's not just phonetically spelled, it's much different. Almost like Twain came up with a list of words, and then had a bunch of 2nd graders tell him the spelling of words they had seen. I guess at some point, you just get good at bad spelling?
- spauldo 10 days ago
  
  Writing in the vernacular, I believe it's called. I do something like that if I'm texting.
  The book "Feersum Endjinn" by Iain M. Banks uses something like this for one of its characters to quite good effect.
  
  2 replies →

ph4evers 10 days ago

Whisper works on 30 second chunks. So yes it can do that and that’s also why it can hallucinate quite a bit.

jeroenhd 10 days ago
The ffmpeg code seems to default to three second chunks (https://ffmpeg.org/ffmpeg-filters.html#whisper-1):
queue The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"
- londons_explore 10 days ago
  
  so if "I scream" is in one chunk, and "is the best dessert" is in the next, then there is no way to edit the first chunk to correct the mistake? That seems... suboptimal!
  I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.
  The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.
  
  8 replies →
anonymousiam 10 days ago
Whisper is excellent, but not perfect.
I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."
- JohnKemeny 10 days ago
  
  Whisper supports adding a context, and if you're transcribing a phone call, you should probably add "Transcribe this phone call with Gem", in which case it would probably transcribe more correctly.
  
  1 reply →
- t-3 10 days ago
  
  That's at least as good as a human, though. Getting to "better-than-human" in that situation would probably require lots of potentially-invasive integration to allow the software to make correct inferences about who the speakers are in order to spell their names correctly, or manually supplying context as another respondent mentioned.
  
  2 replies →
0points 10 days ago

So, yes, and also no.

lgessler 10 days ago

I recommend having a look at 16.3 onward here if you're curious about this: https://web.stanford.edu/~jurafsky/slp3/16.pdf

I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".

shaunpud 10 days ago

I Scream in the Sun https://carmageddon.fandom.com/wiki/I_Scream_in_the_Sun

ec109685 10 days ago

The I is emphasized more in I scream than ice cream I think.

But it’s great point that you need context to be sure.

didacusc 10 days ago

what would it make of this? https://www.youtube.com/watch?v=zyvZUxnIC3k