Comment by ac29
12 hours ago
> I'm curious what the downside for this speed is here
"DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs"
No comments yet
Contribute on Hacker News ↗