← Back to context

Comment by spuz

1 year ago

Is there anyone besides OpenAI working on a speech to speech model? I find it incredibly useful and it's the sole reason that I pay for their service but I do find it very limited. I'd be interested to know if any other groups are doing research on voice models.

Yes. Kyutai released an opened model called moshi : https://github.com/kyutai-labs/moshi

There's also llama-omni and a few others. None of them are even close to 4o from an LLM standpoint. But moshi is called a "foundational" model and U'm hopeful it will be enhanced. Also there's not yet support for those on most backends like llamacpp / ollama etc. So I'd say we're in a trough but we'll get there.

There’s Ultravox as well (from one of the creators of WebRTC): https://github.com/fixie-ai/ultravox

Their model builds a speech-to-speech layer into Llama. Last I checked they have the audio-in part working and they’re working on the audio-out piece.

When I asked advanced voice mode it said that it receives input as audio and generates text as output.

  • It is mistaken because it has no particular insight into its own implementation. In fact the whole point is that it directly consumes and produces audio tokens with no text. That's why it's able to sing, make noises, do accents, and so on.