Figure 10 in https://arxiv.org/pdf/2506.06276 has a speed comparison. You need fairly large batch sizes for this method to come out ahead. The issue is that the architecture is very sequential, so you need to be generating several images at the same time to make good use of GPU parallelism.
It's a bit more complicated than that and I don't think you're being fair.
StarFlow and the AR models are fixed but DiT is being compared at different amount of steps and we don't really care if we generate garbage at blazing speeds[0]. Go look at... also Figure 10 (lol) from the DiT paper[1], it compares FID to model sizes and sampling steps. It looks like StarFlow is comparing to DiT-XL/2-G. In [1] they do {16,32,64,128,256,1024} steps which corresponds to (roughly) 10k-FID of 60, 35, 25, 22, 21, 20. Translating to StarFlow's graph we'll guesstimate 21,23,50. There's a big difference between 50 and 23 but what might surprise you is that there's a big difference between 25 and 20. Remember that this is a metric that is lower bounded, and that lower bound is not 0... You also start running into the limitations of the metric the closer you get to its lower bound, adding another layer of complexity when comparing[2]
The images from the paper (I believe) are all at 250 steps, which StarFlow is beating at a batch size of 4.
So let's look at batches and invert the data. It is imgs/sec so let's do (1/<guestimate of y-value>) * batch.
We get this
Batch DiT SF
1 10s 20s
2 20s 30s
4 40s 30s
8 80s 30s
16 160s 30s
...
So what's happening here is that StarFlow is invariant to the batch size while DiT is not. Obviously this won't hold forever, but DiT doesn't get advantage from batching. You could probably make up these differences by caching the model because it looks like there's a turn from model loading dominating to actual generation dominating. Whereas StarFlow has that turnover at batch 2.
And batching (even small batches) is going to be pretty common, especially when talking about industry. The scaling here is a huge win to them. It (roughly) costs you just as much to generate 64 images as it does 2. Worst case scenario, you hand your customers batched outputs and they end up happier because frankly, generating images is still an iterative process and good luck getting the thing you want on just the first shot even if you got all your parameters dialed in. So yeah, that makes a much better product.
I'll also add 2 things. 1) you can get WAY more compression out of Normalizing Flows 2) there's just a ton you can do with Flows that you can't with diffusion. The explicit density isn't helpful only for the math nerds. It is helpful for editing, concept segmentation, interpolation, interpretation, and so much more.
[2] Basically, place exponentially growing importance on FID gaps as they lower and then abandon the importance completely because it doesn't matter. As an example, take FFHQ-256 with FID-50k. Image quality difference between 50 and 20 is really not that big, visually. But there's a *HUGE** difference between 10 and 5. Visually, probably just as big as the difference between 5 and 3. But once you start going below 3 you really shouldn't rely on the metric anymore and comparing a 2.5 model to 2.7 is difficult.
Figure 10 in https://arxiv.org/pdf/2506.06276 has a speed comparison. You need fairly large batch sizes for this method to come out ahead. The issue is that the architecture is very sequential, so you need to be generating several images at the same time to make good use of GPU parallelism.
It's a bit more complicated than that and I don't think you're being fair.
StarFlow and the AR models are fixed but DiT is being compared at different amount of steps and we don't really care if we generate garbage at blazing speeds[0]. Go look at... also Figure 10 (lol) from the DiT paper[1], it compares FID to model sizes and sampling steps. It looks like StarFlow is comparing to DiT-XL/2-G. In [1] they do {16,32,64,128,256,1024} steps which corresponds to (roughly) 10k-FID of 60, 35, 25, 22, 21, 20. Translating to StarFlow's graph we'll guesstimate 21,23,50. There's a big difference between 50 and 23 but what might surprise you is that there's a big difference between 25 and 20. Remember that this is a metric that is lower bounded, and that lower bound is not 0... You also start running into the limitations of the metric the closer you get to its lower bound, adding another layer of complexity when comparing[2]
The images from the paper (I believe) are all at 250 steps, which StarFlow is beating at a batch size of 4. So let's look at batches and invert the data. It is imgs/sec so let's do (1/<guestimate of y-value>) * batch. We get this
So what's happening here is that StarFlow is invariant to the batch size while DiT is not. Obviously this won't hold forever, but DiT doesn't get advantage from batching. You could probably make up these differences by caching the model because it looks like there's a turn from model loading dominating to actual generation dominating. Whereas StarFlow has that turnover at batch 2.
And batching (even small batches) is going to be pretty common, especially when talking about industry. The scaling here is a huge win to them. It (roughly) costs you just as much to generate 64 images as it does 2. Worst case scenario, you hand your customers batched outputs and they end up happier because frankly, generating images is still an iterative process and good luck getting the thing you want on just the first shot even if you got all your parameters dialed in. So yeah, that makes a much better product.
I'll also add 2 things. 1) you can get WAY more compression out of Normalizing Flows 2) there's just a ton you can do with Flows that you can't with diffusion. The explicit density isn't helpful only for the math nerds. It is helpful for editing, concept segmentation, interpolation, interpretation, and so much more.
[0] https://tvgag.com/content/quotes/6004-jpg.jpg
[1] https://arxiv.org/abs/2212.09748
[2] Basically, place exponentially growing importance on FID gaps as they lower and then abandon the importance completely because it doesn't matter. As an example, take FFHQ-256 with FID-50k. Image quality difference between 50 and 20 is really not that big, visually. But there's a *HUGE** difference between 10 and 5. Visually, probably just as big as the difference between 5 and 3. But once you start going below 3 you really shouldn't rely on the metric anymore and comparing a 2.5 model to 2.7 is difficult.