Comment by boutell

6 hours ago

I'm pretty sure eval time is token generation time where it's actually outputting new tokens. If you're getting a thousand per second on that, I'd love to know on what.

From the prompt timings above, it seems like 'prompt eval time' is the equivalent to 'processing time for input tokens'.

Hyperscalers can perform this evaluation very quickly because evaluation can be significantly parallelized. The layer `i` output of token `j` only requires access to the layer `i-1` output of all previous tokens, so a parallel frontier develops. Token (0,0) [(token, layer)] is processed first, then tokens (0,1) and (1,0) can be processed in parallel, then (0,2), (1,1), and (2,0), and so on.

The maximum parallel width becomes equal to the number of layers in the model. Gemma 4 26B-A4B model discussed in this article evidently has 30 layers, giving a 30-fold speedup if the system were otherwise unconstrained (all layers can be run in parallel, and one full set of layer outputs is completed in the KV pass for each pass of the parallel sweep).

In the specific output above, however, the input prompt is only seven tokens long so there are probably considerable non-amortized spinup effects at play.

  • Seven tokens long input isn't very realistic, is it? For coding tasks it's normal for the input to be thousands or 10s of thousands. If it wasn't for prefix caching it'd be one miserable experience, but even then at the very best the input is often in hundreds each time. And don't even try to dump some logs into the prompt.

    • > Seven tokens long input isn't very realistic, is it?

      The test prompt above was "Why is the sky blue?", so there's the seven tokens. I meant to highlight that because I'd expect processing of a thousand-token input to be faster per token than presented.