Comment by pertymcpert
2 months ago
I have the exact same questions as you. I can barely understand how diffusion works for images, for sequential data like text it makes no sense to me.
2 months ago
I have the exact same questions as you. I can barely understand how diffusion works for images, for sequential data like text it makes no sense to me.
Let’s suppose we have 10k possible tokens in the vocabulary.
Then text would be an image 10k pixels tall and N pixels wide, where N is the length of the text.
For each column, exactly 1 pixel is white (corresponding to the word which is there) and the rest are black.
Then the diffusion process is the same. Repeatedly denoising.
No, that intuition is incorrect.
Denoising models work because a lot of regions turn out to be smooth, you cannot do that "in a discrete way" if that makes sense.
Feel free to give a better explanation. I am not an expert. Clearly denoising models do work on text though.
1 reply →
They may be smooth in embedding space