Comment by simonw

4 days ago

I believe it's an evolution of the technique used in GPT-Image-1 (or whatever they called that), which was derived from their work on making GPT-4o an "omni" model that can directly output images and audio in addition to text.

The 2024 GPT-4o launch post https://openai.com/index/hello-gpt-4o/ hints about how that works:

"With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network."

1 comment

simonw

embedding-shape 4 days ago

Yeah, that's my belief as well, but haven't seen any concrete explanations about how it works, just the marketing/press releases sadly.