Comment by rowanG077

2 days ago

Is there anything truly low latency(sub 100ms)? Speech recognition is so cool but I want it to be low latency.

6 comments

rowanG077

Agree about the latency requirement.

There's https://kyutai.org/stt, which is very low latency. But it seems not as hackable.

Parakeet does streaming I think, so if you throw enough compute at it, it should be. The closest competitor is whisper v3 which is relatively slow, maybe Voxtral but it's still very new.

jasonni 2 days ago

The python MLX version of Parakeet indeed support streaming: https://github.com/senstella/parakeet-mlx It requires modification of the inference algorithm. In this implementation, I see the author even uses a custom metal kernerl to get maximum performance. The Parakeet model batch inference logic is simple. But for streaming, it may require some effort to get the best performance. It's not only the depencency issue.
regularfry 2 days ago
There's a minimum possible latency just given the structure of language and how humans process phonemes. Spoken language isn't quite unambiguously causal so there's a limit to how far you can go for a given accuracy. I don't know where the efficiency curve is though. It wouldn't surprise me if 100ms was pushing it.
- moffkalast 2 days ago
  
  Yeah the metric would be the total processing latency after that. I've found that VAD is honestly harder to get right than STT and if that fails, STT only gets garbage to process. Even humans sometimes have issues figuring out when exactly someone is done talking.

noahkay13 2 days ago

On macbook pro - parakeet.cpp is very low latency, under 100ms (76ms) for 60s audio.