Comment by cafkafk

9 hours ago

> (purple on black is really hard to read)

Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.

> You say it runs "at reading speed". Have you benchmarked it?

At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:

llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128

Gives:

  llama_print_timings:        load time =   83911.65 ms
  llama_print_timings:      sample time =      26.99 ms /   128 runs   (    0.21 ms per token,  4742.15 tokens per second)
  llama_print_timings: prompt eval time =     343.41 ms /     7 tokens (   49.06 ms per token,    20.38 tokens per second)
  llama_print_timings:        eval time =   10639.36 ms /   127 runs   (   83.77 ms per token,    11.94 tokens per second)
  llama_print_timings:       total time =   11114.98 ms /   134 tokens

So 11.94 tokens per second while it's also playing binary cache and CI builder.

When I do it properly, I'll add it to the blog as well!

12 comments

cafkafk

fhars 4 hours ago

And if you ever run out of things to do in your copious free time, it looks like that PR #1744 was merged without the has_target_ctx assert two days after you uploaded your drafter quants. So you can now redo all your quants and rerun all your benchmarks ;-).

ethbr1 2 hours ago

> two days after you uploaded your drafter quants. So you can now redo all your quants and rerun all your benchmarks ;-)
2010s Javascript, putting down the controller: Ha, no one will ever surpass my high score for wasting programmer time with dependency churn...
2026 Open Source ML: Hold my beer.

bbatha 3 hours ago

What's time to first token? Raw throughput is usually not the problem in local setups in my experience.

anon-3988 7 hours ago

I am pretty sure llamacpp have their own benchmarking binary that you can use.

mft_ 6 hours ago

llama-bench is part of the llama-cpp package, but from recent experimentation, the settings it is able to (or is documented to?) accept lag behind somewhat. Not sure whether it would accept all of the esoteric settings in the article?

ekianjo 7 hours ago

20 tokens per second for eval time is the killer here. It means you can't use this to process any meaningful amount of text.

A GPU typically processes close to 1000 tokens/s during eval.

hnfong 3 hours ago

The prompt is literally "why is the sky blue?" and consists of 7 tokens.
It's probably too small for the timings to be taken seriously.
boutell 6 hours ago
I'm pretty sure eval time is token generation time where it's actually outputting new tokens. If you're getting a thousand per second on that, I'd love to know on what.
- Majromax 4 hours ago
  
  From the prompt timings above, it seems like 'prompt eval time' is the equivalent to 'processing time for input tokens'.
  Hyperscalers can perform this evaluation very quickly because evaluation can be significantly parallelized. The layer `i` output of token `j` only requires access to the layer `i-1` output of all previous tokens, so a parallel frontier develops. Token (0,0) [(token, layer)] is processed first, then tokens (0,1) and (1,0) can be processed in parallel, then (0,2), (1,1), and (2,0), and so on.
  The maximum parallel width becomes equal to the number of layers in the model. Gemma 4 26B-A4B model discussed in this article evidently has 30 layers, giving a 30-fold speedup if the system were otherwise unconstrained (all layers can be run in parallel, and one full set of layer outputs is completed in the KV pass for each pass of the parallel sweep).
  In the specific output above, however, the input prompt is only seven tokens long so there are probably considerable non-amortized spinup effects at play.
  
  2 replies →
- ekianjo 3 hours ago
  
  I meant prompt eval time.