Comment by phkahler

6 months ago

I thought whisper and others took large chunks (20-30 seconds) of speech, or a complete wave file as input. How do you get real-time transcription? What size chunks do you feed it?

To me, STT should take a continuous audio stream and output a continuous text stream.

8 comments

phkahler

yujonglee 6 months ago

I use VAD to chunk audio.

Whisper and Moonshine both works in a chunk, but for moonshine:

> Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.

Also for kyutai, we can input continuous audio in and get continuous text out.

- https://github.com/moonshine-ai/moonshine - https://docs.hyprnote.com/owhisper/configuration/providers/k...

zveyaeyv3sfye 6 months ago

Having used whisper and noticed the useless quality due to their 30-second chunks, I would stay far away from software working on even a shorter duration.
The short duration effectively means that the transcription will start producing nonsense as soon as a sentence is cut up in the middle.
mijoharas 6 months ago
Something like that, in a cli tool, that just gives text to stdout would be perfect for a lot of use cases for me!
(maybe with an `owhisper serve` somewhere else to start the model running or whatever.)
- ctbellmar 6 months ago
  
  I wrote a tool that may be just the thing for you:
  https://github.com/bikemazzell/skald-go/
  Just speech to text, CLI only, and it can paste into whatever app you have open.
  
  1 reply →
- yujonglee 6 months ago
  
  Are you thinking about the realtime use-case or batch use-case?
  For just transcribing file/audio,
  `owhisper run <MODEL> --file a.wav` or
  `curl httpsL//something.com/audio.wav | owhisper run <MODEL>`
  might makes sense.
  
  2 replies →