Comment by lukax

20 hours ago

Or you could use Soniox Real-time (supports 60 languages) which natively supports endpoint detection - the model is trained to figure out when a user's turn ended. This always works better than VAD.

https://soniox.com/docs/stt/rt/endpoint-detection

Soniox also wins the independent benchmarks done by Daily, the company behind Pipecat.

https://www.daily.co/blog/benchmarking-stt-for-voice-agents/

You can try a demo on the home page:

https://soniox.com/

Disclaimer: I used to work for Soniox

Edit: I commented too soon. I only saw VAD and immediately thought of Soniox which was the first service to implement real time endpoint detection last year.

5 comments

lukax

nicktikhonov 20 hours ago

If you read the post, you'll see that I used Deepgram's Flux. It also does endpointing and is a higher-level abstraction than VAD.

lukax 19 hours ago
Sorry, I commented too soon. Did you also try Soniox? Why did you decide to use Deepgram's Flux (English only)?
- nicktikhonov 19 hours ago
  
  I didn't try Soniox, but I made a note to check it out! I chose Flux because I was already using Deepgram for STT and just happened to discover it when I was doing research. It would definitely be a good follow-up to try out all the different endpointing solutions to see what would shave off additional latency and feel most natural.
  Another good follow-up would be to try PersonaPlex, Nvidia's new model that would completely replace this architecture with a single model that does everything:
  https://research.nvidia.com/labs/adlr/personaplex/
satvikpendem 15 hours ago

I second Soniox as well, as a user. It really does do way better than Deepgram and others. If your app architecture is good enough then maybe replacing providers shouldn't be too hard.

satvikpendem 15 hours ago

I'm using them, how has it been like working there? I see they have some consumer products as well. I wonder how they get state of the art for such low prices over the competition.