Comment by sosodev
4 days ago
Huh, you're right. I tried your test and it clearly can't understand the difference between homophones. That seems to imply they're using some sort of TTS mechanism. Which is really weird because Qwen3-Omni claims to support direct audio input into the model. Maybe it's a cost saving measure?
Weirdly, I just tried it again and it seems to understand the difference between record and record just fine. Perhaps if there's heavy demand for voice chat, like after a new release, they load shed by using TTS to a smaller model.
However, It still doesn't seem capable of producing any of the sounds, like laughter, that I would expect from a native voice model.
To be fair, discerning heteronyms might just be a gap in its training.