Comment by sosodev

5 days ago

Huh, you're right. I tried your test and it clearly can't understand the difference between homophones. That seems to imply they're using some sort of TTS mechanism. Which is really weird because Qwen3-Omni claims to support direct audio input into the model. Maybe it's a cost saving measure?

Weirdly, I just tried it again and it seems to understand the difference between record and record just fine. Perhaps if there's heavy demand for voice chat, like after a new release, they load shed by using TTS to a smaller model.

However, It still doesn't seem capable of producing any of the sounds, like laughter, that I would expect from a native voice model.