Comment by embedding-shape
5 days ago
> Also, GPT-Image-2 is not a diffusion model, it is based on Transformers, like other LLMs are.
Where are you getting this from btw? AFAIK, OpenAI hasn't openly talked about what exactly is powering the Images 2.0 stuff, unless I missed something? I think they've said it's not a diffusion model, but I'm not sure they've said what they're doing instead, have they?
I believe it's an evolution of the technique used in GPT-Image-1 (or whatever they called that), which was derived from their work on making GPT-4o an "omni" model that can directly output images and audio in addition to text.
The 2024 GPT-4o launch post https://openai.com/index/hello-gpt-4o/ hints about how that works:
"With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network."
Yeah, that's my belief as well, but haven't seen any concrete explanations about how it works, just the marketing/press releases sadly.