Comment by lisperforlife

2 years ago

Why is this not the top comment? FAIR published their C3MLeon paper about decoder-only autoregressive models that work with both text and image tokens. I believe GPT-4o's vocabulary has room for both image and audio tokens. For audio tokens, they probably trained an RVQ-VAE model like Encodec or Soundstream.

0 comments