← Back to context

Comment by alfanick

3 months ago

I'm not looking for STT->AI->TTS, I'm looking for truly good voice-to-text experience* on Linux (and others). Siri/iOS-Dictation is truly good when it comes to understanding the speech. Something this level on Linux (and others) would be great, yeah always listening, maybe sending the data somewhere, but give me UX - hidden latency, optimizing for first chars recognized - a good (virtual) input device.

> Siri/iOS-Dictation is truly good when it comes to understanding the speech.

What...? It is terrible, even compared to Whisper Tiny, which was released years ago under an Apache 2.0 license so Apple could have adopted it instantly and integrated it into their devices. The bigger Whisper models are far better, and Parakeet TDT V2 (English) / V3 (Multilingual) are quite impressive and very fast.

I have no idea what would make someone say that iOS dictation is good at understanding speech... it is so bad.

For a company that talks so much about accessibility, it is baffling to me that Apple continues to ship such poor quality speech to text with their devices.

  • Its quality isn’t great, but it is damn fast and that matters a lot! Whisper doesn’t even work live without hacks.

    • Parakeet is insanely fast and much more accurate, and it doesn't really matter that Whisper requires hacks to work live when those hacks have existed for years and work great. (The Hello Transcribe app on iOS is a great example of how well Whisper can work with live streaming on an iPhone. The smaller models are extremely fast, even with the "hacks".)

      Parakeet TDT's architecture is actually a really cool way to boost both the speed and efficiency of real time STT compared to traditional approaches.

  • Terrible? It's fine. What's your accent that it's terrible? It even pulls last names from my address book and spells them right.

    • Terrible relative to everything else that exists today. I have a neutral American accent.

      Maybe you just don’t know what you’re missing? Google’s default speech to text is still bad compared to Whisper and Parakeet, but even Google’s is markedly better than Apple’s.

      I cannot think of a single speech to text system that I’ve run into in the past 5 years that is less accurate than the one Apple ships.

      Sure, Apple’s speech to text is incredible compared to what was on the flip phone I had 20 years ago. Terrible is relative. Much better options exist today, and they’re under very permissive licenses. Apple’s refusal to offer a better, more accessible experience to their users is frustrating when they wouldn’t even have to pay a licensing fee to ship something better. Whisper was released under a permissive license nearly 4 years ago.

      Apple also restricts third party keyboards to an absurdly tiny amount of memory, so it isn’t even possible to ship a third party keyboard that provides more accurate on-device speech to text without janky workarounds (requiring the user to open the keyboard's own app first each time).

      5 replies →

Have you tried https://handy.computer ?

  • Not bad, almost checks all the marks I want. A) Good quality, locally run model, and surprisingly fast and working on my CPU. B) It transcribes after the session is finished (aka stopped push-to-talk, or after stopping the listening). C) Ha nice, post-processing. D) Still not solved, truly realtime transcription with latency hiding - start typing as soon as you recognize sounds (or after some logical pause, i.e. at the end of sentence). E) Written in Rust, with web-browser config ui. F) Global shortcuts are super finnicky, doesn't recognize my default "Mic" button, fair enough, let me remap to some unused F24... Doesn't recognize F24 due to missing keycode.

    It's there, doesn't feel native though. Good integration, not great though (Linux Mint/Cinnamon).

Understood, you want dictation, not a chatbot. That's a valid and different use case.

RCLI is Apple Silicon only today because MetalRT is built on Metal. For Linux, the closest thing to what you're describing would be building a virtual input device on top of Whisper or Parakeet (which RCLI supports as STT backends). Parakeet TDT 0.6B has ~1.9% WER, that's very close to production dictation quality.

The missing piece on Linux isn't the model, it's the integration: a daemon that captures mic audio, runs STT with hidden latency (streaming partial results), and injects text as keyboard input. sherpa-onnx (https://github.com/k2-fsa/sherpa-onnx) supports Linux and has streaming STT, it might be the best starting point for what your after.

We're focused on Apple Silicon for now but broader platform support is on the roadmap.

I use voxtype on my Linux machine with parakeet. Super fast and regularly even gets the tech lingo correct. You can configure prompts and keywords to help with that as well.

> I'm not looking for STT->AI->TTS, I'm looking for truly good voice-to-text experience

Umm, ah, wait no, uhh yes you are. Unless, hang on, you are possessed with greater umm speech capabilities than most, wait nevermind start over. Unless you never make a mistake while talking, you want AI to take out the "three, wait no four" and just leave the output with "four" from what you actually spoke. Depending on your use case.

  • It’s the TTS layer that is weird. I’m in the same boat — speech out is just a much worse modality than text when possible.

    • Agreed for a lot of use cases. RCLI supports text-only mode (--no-speak flag or just type in the TUI instead of using push-to-talk). TTS makes sense for hands-free / eyes-free scenarios, but we dont force it.