Comment by perfmode

16 hours ago

How’s the token throughput / response time?

11 comments

perfmode

Healthy!

  prefill: 30.91 t/s, generation: 29.58 t/s

antirez 1 hour ago

Prefill is 400 t/s in that hardware. Just if the prompt is very short you can't see the real speed and it will default to single token context processing.
embedding-shape 15 hours ago

Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:
prefill: 121.76 t/s, generation: 47.85 t/s
Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.
xienze 16 hours ago
I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.
- fgfarben 14 hours ago
  
  That prefill number isn't right. M4 Max hits 200-300: https://github.com/antirez/ds4/blob/main/speed-bench/m4_max_...
  
  1 reply →
- throwdbaaway 8 hours ago
  
  Hah, that's because the prompt itself was only about 30 tokens. We need a much bigger prompt to properly test PP.
- aiscoming 16 hours ago
  
  if it's just the coding agent system prompt and tools, you can cache that
  
  2 replies →
rtpg 14 hours ago

what are token speeds like for frontier models, if that gives a rough idea of how much "slower" slow is?