Comment by red2awn
5 days ago
This is a stack of models:
- 650M Audio Encoder
- 540M Vision Encoder
- 30B-A3B LLM
- 3B-A0.3B Audio LLM
- 80M Transformer/200M ConvNet audio token to waveform
This is a closed source weight update to their Qwen3-Omni model. They had a previous open weight release Qwen/Qwen3-Omni-30B-A3B-Instruct and a closed version Qwen3-Omni-Flash.
You basically can't use this model right now since none of the open source inference framework have the model fully implemented. It works on transformers but it's extremely slow.
No comments yet
Contribute on Hacker News ↗