I spent a day (~$100 in API credits) rebuilding the core orchestration loop of a real-time AI voice agent from scratch instead of using an all-in-one SDK. The hard part isn’t STT, LLMs, or TTS in isolation, but turn-taking: detecting when the user starts and stops speaking, cancelling in-flight generation instantly, and pipelining everything to minimize time-to-first-audio.
The write-up covers why VAD alone fails for real turn detection, how voice agents reduce to a minimal speaking/listening loop, why STT → LLM → TTS must be streaming rather than sequential, why TTFT matters more than model quality in voice, and why geography dominates latency. By colocating Twilio, Deepgram, ElevenLabs, and the orchestration layer, I reached ~790ms end-to-end latency, slightly faster than an equivalent Vapi setup.
Two things that bit us building production voice agents:
1) “Barge‑in” feels broken unless you can cancel TTS + LLM immediately (sub‑second) and you treat partial STT hypotheses as first-class signals (not just final transcripts). A simple trick: trigger cancel on any sustained non-silence above a low threshold, then re-enable once you’ve seen N ms of silence.
2) Echo / duplex audio: if you don’t subtract your own TTS audio (or at least gate VAD while TTS is playing), you’ll get false user-starts. Even a crude ‘TTS playing → raise VAD threshold’ helps.
We’re building eboo.ai (voice agents w/ fast barge‑in + streaming orchestration) and ended up with a very similar architecture (telephony + STT + TTS co-located, everything streaming). If you’re curious, happy to compare notes on jitter buffers / geo placement and what’s worked in the wild.
I spent a day (~$100 in API credits) rebuilding the core orchestration loop of a real-time AI voice agent from scratch instead of using an all-in-one SDK. The hard part isn’t STT, LLMs, or TTS in isolation, but turn-taking: detecting when the user starts and stops speaking, cancelling in-flight generation instantly, and pipelining everything to minimize time-to-first-audio.
The write-up covers why VAD alone fails for real turn detection, how voice agents reduce to a minimal speaking/listening loop, why STT → LLM → TTS must be streaming rather than sequential, why TTFT matters more than model quality in voice, and why geography dominates latency. By colocating Twilio, Deepgram, ElevenLabs, and the orchestration layer, I reached ~790ms end-to-end latency, slightly faster than an equivalent Vapi setup.
Nice write-up — turn-taking is the whole game.
Two things that bit us building production voice agents: 1) “Barge‑in” feels broken unless you can cancel TTS + LLM immediately (sub‑second) and you treat partial STT hypotheses as first-class signals (not just final transcripts). A simple trick: trigger cancel on any sustained non-silence above a low threshold, then re-enable once you’ve seen N ms of silence. 2) Echo / duplex audio: if you don’t subtract your own TTS audio (or at least gate VAD while TTS is playing), you’ll get false user-starts. Even a crude ‘TTS playing → raise VAD threshold’ helps.
We’re building eboo.ai (voice agents w/ fast barge‑in + streaming orchestration) and ended up with a very similar architecture (telephony + STT + TTS co-located, everything streaming). If you’re curious, happy to compare notes on jitter buffers / geo placement and what’s worked in the wild.
[dead]