← Back to context

Comment by londons_explore

10 days ago

Does this have the ability to edit historic words as more info becomes available?

Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".

Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".

Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.

A good opportunity to point people to the paper with my favorite title of all time:

"How to wreck a nice beach you sing calm incense"

https://dl.acm.org/doi/10.1145/1040830.1040898

This is what your brain does when it processes language.

I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.

  • A slight segue to this; I was made aware of the phenomena that - The language in which you think in, sets the constraints to which you level of expanse the brain can think and parse information in.

    I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.

    └── Dey well; Be well

    • This is called linguist relativity (nee. The Sapir-Whorf hypothesis) and the strong form you describe has fallen out of favour in modern linguistics.

      A surprising number of monolingual people think their own language is the most adaptable and modern language, but this is obviously untrue. All languages evolve to fit the needs of speakers.

      Also, the idea that people "think in language X" is heavily disputed. One obvious counterargument is that most people have experienced the feeling of being unable to express what they are thinking into words -- if you truly did think in the language you speak, how could this situation happen? My personal experience is that I do not actively hear any language in my head while unless I actively try to think about it (at least, since I was a teenager).

      (This is all ignoring the comments about ESL speakers that I struggle to read as anything but racism. As someone who speaks multiple languages, it astounds me how many people seem to think that struggling to express something in your non-native language means that you're struggling to think and are therefore stupid.)

      5 replies →

It makes me curious about how human subtitlers or even scriptwriters choose to transcribe intentionally ambiguous speech, puns and narratively important mishearings. It's like you need to subtitle what is heard not what is said.

Do those born profoundly deaf specifically study word sounds in order to understand/create puns, rhymes and such so they don't need assistance understanding narrative mishearings?

It must feel like a form of abstract mathematics without the experiential component... but then I suspect mathematicians manufacture an experiential phenomena with their abstractions with their claims of a beauty like music... hmm!

  • I had similar thoughts when reading Huck Finn. It's not just phonetically spelled, it's much different. Almost like Twain came up with a list of words, and then had a bunch of 2nd graders tell him the spelling of words they had seen. I guess at some point, you just get good at bad spelling?

    • Writing in the vernacular, I believe it's called. I do something like that if I'm texting.

      The book "Feersum Endjinn" by Iain M. Banks uses something like this for one of its characters to quite good effect.

      2 replies →

Whisper works on 30 second chunks. So yes it can do that and that’s also why it can hallucinate quite a bit.

  • The ffmpeg code seems to default to three second chunks (https://ffmpeg.org/ffmpeg-filters.html#whisper-1):

        queue
        
             The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"

    • so if "I scream" is in one chunk, and "is the best dessert" is in the next, then there is no way to edit the first chunk to correct the mistake? That seems... suboptimal!

      I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.

      The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.

      8 replies →

  • Whisper is excellent, but not perfect.

    I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."

    • Whisper supports adding a context, and if you're transcribing a phone call, you should probably add "Transcribe this phone call with Gem", in which case it would probably transcribe more correctly.

      1 reply →

    • That's at least as good as a human, though. Getting to "better-than-human" in that situation would probably require lots of potentially-invasive integration to allow the software to make correct inferences about who the speakers are in order to spell their names correctly, or manually supplying context as another respondent mentioned.

      2 replies →

I recommend having a look at 16.3 onward here if you're curious about this: https://web.stanford.edu/~jurafsky/slp3/16.pdf

I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".

The I is emphasized more in I scream than ice cream I think.

But it’s great point that you need context to be sure.