← Back to context

Comment by pertymcpert

2 months ago

I have the exact same questions as you. I can barely understand how diffusion works for images, for sequential data like text it makes no sense to me.

Let’s suppose we have 10k possible tokens in the vocabulary.

Then text would be an image 10k pixels tall and N pixels wide, where N is the length of the text.

For each column, exactly 1 pixel is white (corresponding to the word which is there) and the rest are black.

Then the diffusion process is the same. Repeatedly denoising.