← Back to context

Comment by yujonglee

6 months ago

Happy to answer any questions!

These are list of local models it supports:

- whisper-cpp-base-q8

- whisper-cpp-base-q8-en

- whisper-cpp-tiny-q8

- whisper-cpp-tiny-q8-en

- whisper-cpp-small-q8

- whisper-cpp-small-q8-en

- whisper-cpp-large-turbo-q8

- moonshine-onnx-tiny

- moonshine-onnx-tiny-q4

- moonshine-onnx-tiny-q8

- moonshine-onnx-base

- moonshine-onnx-base-q4

- moonshine-onnx-base-q8

I thought whisper and others took large chunks (20-30 seconds) of speech, or a complete wave file as input. How do you get real-time transcription? What size chunks do you feed it?

To me, STT should take a continuous audio stream and output a continuous text stream.

  • I use VAD to chunk audio.

    Whisper and Moonshine both works in a chunk, but for moonshine:

    > Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.

    Also for kyutai, we can input continuous audio in and get continuous text out.

    - https://github.com/moonshine-ai/moonshine - https://docs.hyprnote.com/owhisper/configuration/providers/k...

    • Having used whisper and noticed the useless quality due to their 30-second chunks, I would stay far away from software working on even a shorter duration.

      The short duration effectively means that the transcription will start producing nonsense as soon as a sentence is cut up in the middle.

    • Something like that, in a cli tool, that just gives text to stdout would be perfect for a lot of use cases for me!

      (maybe with an `owhisper serve` somewhere else to start the model running or whatever.)

      5 replies →

FYI: owhisper pull whisper-cpp-large-turbo-q8 Failed to download model.ggml: Other error: Server does not support range requests. Got status: 200 OK

But the base-q8 works (and works quite well!). The TUI is really nice. Speaker diarization would make it almost perfect for me. Thanks for building this.

Sorry, maybe I missed it but I didn't see this list on your website. I think it is a good idea to add this info there. Besides that, thank you for the effort and your work! I will definetely give it a try