Comment by janalsncm
2 months ago
Let’s suppose we have 10k possible tokens in the vocabulary.
Then text would be an image 10k pixels tall and N pixels wide, where N is the length of the text.
For each column, exactly 1 pixel is white (corresponding to the word which is there) and the rest are black.
Then the diffusion process is the same. Repeatedly denoising.
No, that intuition is incorrect.
Denoising models work because a lot of regions turn out to be smooth, you cannot do that "in a discrete way" if that makes sense.
Feel free to give a better explanation. I am not an expert. Clearly denoising models do work on text though.
This one's closer to the thing.
https://news.ycombinator.com/item?id=44059646
They may be smooth in embedding space