Comment by mickeyp
1 hour ago
Impressive work. But the problem is not the 30 tok/s which is fine for agentic coding and chat.
It's prefill; slow prefill kills agentic workloads dead.
If you have 100,000 tokens at ~150tok/s per the OP, you're looking at:
You have: 100000 / (150/s)
You want: hms
11 min + 6.6666667 sec
Which is quite a wait indeed.
Most people won’t be dumping 100K tokens into it at once, but I agree that all of the prefill time that adds up during a session becomes a lot to account for.
This is also a problem for all of the Mac local LLMs. Macs are a great way to get a lot of high bandwidth memory, but their compute is very far behind current gen dedicated GPUs. Some of the expensive Mac Studio setups allow you to run very large models with usable tokens/s, but you can be waiting a long time for it to get to the point of generating those tokens.