← Back to context

Comment by yujonglee

6 months ago

I use VAD to chunk audio.

Whisper and Moonshine both works in a chunk, but for moonshine:

> Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.

Also for kyutai, we can input continuous audio in and get continuous text out.

- https://github.com/moonshine-ai/moonshine - https://docs.hyprnote.com/owhisper/configuration/providers/k...

Having used whisper and noticed the useless quality due to their 30-second chunks, I would stay far away from software working on even a shorter duration.

The short duration effectively means that the transcription will start producing nonsense as soon as a sentence is cut up in the middle.

Something like that, in a cli tool, that just gives text to stdout would be perfect for a lot of use cases for me!

(maybe with an `owhisper serve` somewhere else to start the model running or whatever.)