← Back to context

Comment by akshayKMR

1 day ago

Why not use the Gemini flash voice-api directly instead? Cost? I ask because from the demo, the tutor's voice seems mechanical. I've played with the gemini voice api and it's quite impressive for conversation with low latency, I'd say perfect for your use case. It even switches languages if I say "Okay, let's talk in $foo language".

The vocabulary tooling looks neat and well thought out.

Multiple reasons (which also apply to openAIs realtime API): - it's less intelligent than the non voice apis - intelligence degrades even further with lots of context - more expensive - latency is not a free lunch, it comes at the cost of more interruptions from the tutor, which is a really bad UX. We prefer to interrupt less and have higher latency

Also, we prefer the eleven labs voices, but there is definitely varying quality. I'm guessing later this year or next, the voice to voice models will become good enough, and we will switch over.