Comment by d4rkp4ttern

16 days ago

This is not strictly speech-to-speech, but I quite like it when working with Claude Code or other CLI Agents:

STT: Handy [1] (open-source), with Parakeet V3 - stunningly fast, near-instant transcription. The slight accuracy drop relative to bigger models is immaterial when you're talking to an AI. I always ask it to restate back to me what it understood, and it gives back a nicely structured version -- this helps confirm understanding as well as likely helps the CLI agent stay on track.

TTS: Pocket-TTS [2], just 100M params, and amazing speech quality (English only). I made a voice plugin [3] based on this, for Claude Code so it can speak out short updates whenever CC stops. It uses a non-blocking stop hook that calls a headless agent to create the 1/2-sentence summary. Turns out to be surprisingly useful. It's also fun as you can customize the speaking style and mirror your vibe etc.

The voice plugin gives commands to control it:

    /voice:speak stop
    /voice:speak azelma (change the voice)
    /voice:speak <your arbitrary prompt to control the style or other aspects>

[1] Handy https://github.com/cjpais/Handy

[2] Pocket-TTS https://github.com/kyutai-labs/pocket-tts

[3] Voice plugin for Claude Code: https://github.com/pchalasani/claude-code-tools?tab=readme-o...

12 comments

d4rkp4ttern

skrebbel 16 days ago

Wow Handy works impressively well! Excellent UX too (on Windows at least).

raajg 9 days ago

I've been dabbling with STT quite a bit and built my own tool using Deepgram. But just tried Handy and it's SO FREAKING FAST! Love it.

d4rkp4ttern 9 days ago

Yes especially with Parakeet V3. It’s also nicely hackable, I Clauded a couple PRs to improve the experience, like removing stutters and filler words.

freakynit 11 days ago

A 25MB TTS model: https://github.com/kittenml/kittentts

d4rkp4ttern 11 days ago
Nice, I’ll have to try it out. They should really make a uv-installable CLI tool like pocket-TTS did. People underestimate just how much more immediately usable something becomes when you can simply get something by doing “uv tool install …”
- freakynit 11 days ago
  
  True that. People, especially developers, underestimate the importance of packaging. Or, in general, making it easier for others to use your product.
- d4rkp4ttern 9 days ago
  
  So I benchmarked it and there’s really no advantage over pocket TTS. There are some tradeoffs like Kitten doesn’t have streaming audio.

indigodaddy 16 days ago

Hi, so I'm looking for an stt that can happen on a server/cron, that will use a small local model (I have 4 vCPU threadripper CPU only and 20G ram on the server) and be able to transcribe from remote audio URLs (preferably, but I know that local models probably don't have this feature so will have to do something like curl the audio down to memory or /tmp and then transcribe and then remove the file etc).

Have any thoughts?

d4rkp4ttern 15 days ago
I’ve no thoughts on that unfortunately.
- indigodaddy 15 days ago
  
  :)

3dsnano 16 days ago

posts like this are why i visit HN daily!!!

thanks for sharing your knowledge; can’t wait to try out your voice plugin

d4rkp4ttern 15 days ago

Same!
Feel free to file a gh issue if you have problems with the voice plugin