Comment by regularfry

12 hours ago

This is a different model with, confusingly, approximately the same number of params as the existing gemma4 MoE. Unclear from a quick scan whether one was trained somehow from the other.

The mechanism isn't the same as speculative decoding. Speculative decoding happens sequentially and (usually) a couple of tokens at a time; diffusion doesn't, and does blocks of text at once. I haven't read the collateral yet but my assumption would be that it's trained to keep the specific experts stable across a diffusion block.

2 comments

regularfry

bachmeier 12 hours ago

Thanks. I found this other comment that links to a very thorough explanation: https://news.ycombinator.com/item?id=48479042