← Back to context

Comment by avidphantasm

5 days ago

Not sure where 40 tokens per second is coming from. I’ve seen 95-100 tokens per second on M5 Max 128GB running Gemma 4 31B. I’ve done experiments where it is faster than Claude Opus 4.5 for the same prompts.

can you provide your configurations pls ?

  • It's actually a bit faster than that now it seems, about 112 tok/sec.

    Configuration:

    Gemma 4 31B Instruct Q6K Context size 40960 LM Studio 0.4.13+1 Metal llama.cpp v2.14.0 LM Studio MLX (Apple M5) v1.6.0

    Here are my results:

    prompt eval time = 32545.36 ms / 5625 tokens ( 5.79 ms per token, 172.84 tokens per second) eval time = 20227.99 ms / 310 tokens ( 65.25 ms per token, 15.33 tokens per second) total time = 52773.35 ms / 5935 tokens

    This was for interacting with a local MCP service, running a tool that returns a ~20KB text file to the agent to add to the chat context.

    I'm seeing about the same number of tokens/second on an M2 Ultra that I have access to (also with 128GB of memory).

    This is surely apples-to-oranges to the OP results (and I don't spend a great deal of time benchmarking these things, so my methodology might be lacking), but it's interesting seeing okay performance for a top open model. For most use, however, I find Gemma 4 26B A4B (Q6K) to be good enough (esp. for MCP calling) and much much faster (~1,200 tokens/second).