Comment by lukax
20 hours ago
Or you could use Soniox Real-time (supports 60 languages) which natively supports endpoint detection - the model is trained to figure out when a user's turn ended. This always works better than VAD.
https://soniox.com/docs/stt/rt/endpoint-detection
Soniox also wins the independent benchmarks done by Daily, the company behind Pipecat.
https://www.daily.co/blog/benchmarking-stt-for-voice-agents/
You can try a demo on the home page:
Disclaimer: I used to work for Soniox
Edit: I commented too soon. I only saw VAD and immediately thought of Soniox which was the first service to implement real time endpoint detection last year.
If you read the post, you'll see that I used Deepgram's Flux. It also does endpointing and is a higher-level abstraction than VAD.
Sorry, I commented too soon. Did you also try Soniox? Why did you decide to use Deepgram's Flux (English only)?
I didn't try Soniox, but I made a note to check it out! I chose Flux because I was already using Deepgram for STT and just happened to discover it when I was doing research. It would definitely be a good follow-up to try out all the different endpointing solutions to see what would shave off additional latency and feel most natural.
Another good follow-up would be to try PersonaPlex, Nvidia's new model that would completely replace this architecture with a single model that does everything:
https://research.nvidia.com/labs/adlr/personaplex/
I second Soniox as well, as a user. It really does do way better than Deepgram and others. If your app architecture is good enough then maybe replacing providers shouldn't be too hard.
I'm using them, how has it been like working there? I see they have some consumer products as well. I wonder how they get state of the art for such low prices over the competition.