Comment by phkahler
6 months ago
I thought whisper and others took large chunks (20-30 seconds) of speech, or a complete wave file as input. How do you get real-time transcription? What size chunks do you feed it?
To me, STT should take a continuous audio stream and output a continuous text stream.
I use VAD to chunk audio.
Whisper and Moonshine both works in a chunk, but for moonshine:
> Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.
Also for kyutai, we can input continuous audio in and get continuous text out.
- https://github.com/moonshine-ai/moonshine - https://docs.hyprnote.com/owhisper/configuration/providers/k...
Having used whisper and noticed the useless quality due to their 30-second chunks, I would stay far away from software working on even a shorter duration.
The short duration effectively means that the transcription will start producing nonsense as soon as a sentence is cut up in the middle.
Something like that, in a cli tool, that just gives text to stdout would be perfect for a lot of use cases for me!
(maybe with an `owhisper serve` somewhere else to start the model running or whatever.)
I wrote a tool that may be just the thing for you:
https://github.com/bikemazzell/skald-go/
Just speech to text, CLI only, and it can paste into whatever app you have open.
1 reply →
Are you thinking about the realtime use-case or batch use-case?
For just transcribing file/audio,
`owhisper run <MODEL> --file a.wav` or
`curl httpsL//something.com/audio.wav | owhisper run <MODEL>`
might makes sense.
2 replies →