Comment by tomrod
2 days ago
These are in the same space, but are diffusion models that match text to picture outputs. VLMs are common in the space, but to my understanding work in reverse, extract text from images.
2 days ago
These are in the same space, but are diffusion models that match text to picture outputs. VLMs are common in the space, but to my understanding work in reverse, extract text from images.
The modern VLMs are more powerful. Instead of invoking text to image or image to text as a tool, the models are trained as multimodal models and it’s a single transformer model where the latent space between text and image is blurred. So you can say something like “draw me an image with the instructions from this image” and without any tool calling it’ll read the image, understand the text instructions contained therein and execute that.
There’s no diffusion anywhere which is kind of dying out except as maybe purpose-built image editing tools.
Ah, thanks for the clarification.