Comment by spuz
1 year ago
It's not exactly clear is this a voice-to-voice model or a voice-to-text-to-voice model? When it is finally released, OpenAI claim their GPT4o audio model will be a lot faster at conversations because there's no delay to convert from audio to text and back to audio again. I'm also looking forward to using voice models for language learning.
Full technical write-up here: https://www.daily.co/blog/the-worlds-fastest-voice-bot/
It's a voice-to-text-to-voice approach, as implied by this description:
"host transcription, LLM inference, and voice generation all together in one place"
I think there are some benefits to going through text rather than using a voice-to-voice model. It creates a 100% reliable paper trail of what the model heard and said in the conversation. This can be extremely important in some applications where you need to review and validate what was said.
There are way more text training data than voice data. It also allows you to use all the benchmarks and tool integrations that have already been developed for LLMs.