Comment by thomasahle

2 months ago

> Now, diffusion LMs take this idea further. BERT can recover 15% of masked tokens ("noise"), but why stop here. Let's train a model to recover texts with 30%, 50%, 90%, 100% of masked tokens.

> Once you've trained that, in order to generate something from scratch, you start by feeding the model all [MASK]s. It will generate you mostly gibberish, but you can take some tokens (let's say, 10%) at random positions and assume that these tokens are generated ("final")

This is clearly wrong. If you actually froze 10% of gibberish tokens, your output would be terrible!

What you actually do in discrete statespace diffusion (see e.g. [1]) is to allow every token to change at every time step.

You combine this with a "schedule" that allows the model to know how close it is to being done. E.g. at t=0/20 the changes will be large, and at t=19/20 only small refinements are made.

Update: There is actually a kind of model that "greedily" freezes the top-p most confident tokens, similar to what the blog post describes (though not at random!) this is called MaskGit [2], but it is not a diffusion model and doesn't work as well.

Btw, you can also just use "continuous diffusion" with a transformer/bert model, where you've removed the top softmax layer. Then everything works as normal with Gaussian noise, and you just do softmax at the the final time step.

[1] https://arxiv.org/abs/2107.03006

[2] https://arxiv.org/abs/2202.04200

0 comments

thomasahle

No comments yet

Contribute on Hacker News ↗