Comment by rowanG077

2 days ago

Is there anything truly low latency(sub 100ms)? Speech recognition is so cool but I want it to be low latency.

Parakeet does streaming I think, so if you throw enough compute at it, it should be. The closest competitor is whisper v3 which is relatively slow, maybe Voxtral but it's still very new.

  • The python MLX version of Parakeet indeed support streaming: https://github.com/senstella/parakeet-mlx It requires modification of the inference algorithm. In this implementation, I see the author even uses a custom metal kernerl to get maximum performance. The Parakeet model batch inference logic is simple. But for streaming, it may require some effort to get the best performance. It's not only the depencency issue.

  • There's a minimum possible latency just given the structure of language and how humans process phonemes. Spoken language isn't quite unambiguously causal so there's a limit to how far you can go for a given accuracy. I don't know where the efficiency curve is though. It wouldn't surprise me if 100ms was pushing it.

    • Yeah the metric would be the total processing latency after that. I've found that VAD is honestly harder to get right than STT and if that fails, STT only gets garbage to process. Even humans sometimes have issues figuring out when exactly someone is done talking.