Comment by simonw

4 days ago

I believe it's an evolution of the technique used in GPT-Image-1 (or whatever they called that), which was derived from their work on making GPT-4o an "omni" model that can directly output images and audio in addition to text.

The 2024 GPT-4o launch post https://openai.com/index/hello-gpt-4o/ hints about how that works:

"With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network."

Yeah, that's my belief as well, but haven't seen any concrete explanations about how it works, just the marketing/press releases sadly.