Comment by avidphantasm

4 days ago

It's actually a bit faster than that now it seems, about 112 tok/sec.

Configuration:

Gemma 4 31B Instruct Q6K Context size 40960 LM Studio 0.4.13+1 Metal llama.cpp v2.14.0 LM Studio MLX (Apple M5) v1.6.0

Here are my results:

prompt eval time = 32545.36 ms / 5625 tokens ( 5.79 ms per token, 172.84 tokens per second) eval time = 20227.99 ms / 310 tokens ( 65.25 ms per token, 15.33 tokens per second) total time = 52773.35 ms / 5935 tokens

This was for interacting with a local MCP service, running a tool that returns a ~20KB text file to the agent to add to the chat context.

I'm seeing about the same number of tokens/second on an M2 Ultra that I have access to (also with 128GB of memory).

This is surely apples-to-oranges to the OP results (and I don't spend a great deal of time benchmarking these things, so my methodology might be lacking), but it's interesting seeing okay performance for a top open model. For most use, however, I find Gemma 4 26B A4B (Q6K) to be good enough (esp. for MCP calling) and much much faster (~1,200 tokens/second).