Comment by lisperforlife
9 months ago
Why is this not the top comment? FAIR published their C3MLeon paper about decoder-only autoregressive models that work with both text and image tokens. I believe GPT-4o's vocabulary has room for both image and audio tokens. For audio tokens, they probably trained an RVQ-VAE model like Encodec or Soundstream.
No comments yet
Contribute on Hacker News ↗