Comment by petesergeant
1 day ago
> Built on top of Together Turbo Speculator, ATLAS reaches up to 500 TPS on DeepSeek-V3.1 and up to 460 TPS on Kimi-K2 in a fully adapted scenario — 2.65x faster than standard decoding, outperforming even specialized hardware like Groq
and yet, if you click on: https://openrouter.ai/moonshotai/kimi-k2-0905
You'll see Groq averaging 1,086tps vs Together doing 59tps. Groq and Cerebras often feel like the only games in town. I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.
Comparing how quickly gpt-oss-120b runs gives a broader picture: https://openrouter.ai/openai/gpt-oss-120b -- Vertex (Google) and SambaNova do pretty good on it too, but still, the difference between a top provider and an also-ran is giant.
God I love OpenRouter.
> I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.
I'm currently on the Cerebras Code subscription for like 50 USD a month because it more or less makes the rate limits I used to deal with other platforms disappear (without making me spend upwards of 100 USD paying per token): https://www.cerebras.ai/blog/introducing-cerebras-code
At the same time, their Qwen Coder 480B model is fine but I still find myself going for Claude or GPT-5 or Gemini 2.5 Pro for more complex issues (or ones where I need good usage of Latvian language), at least for programming tasks it'd eventually be super cool if they could offer more models.
Or have some sort of a partnership with Anthropic or whoever, because getting my questions answered at around 500-1500 TPS is really, really pleasant, especially for agentic use cases with code modifications, even if I still bump into the 128k context limits occasionally.
Interesting, if you take a look at the median throughput chart [0], groq goes insane after 7th Oct. Wonder what happened.
[0] https://openrouter.ai/moonshotai/kimi-k2-0905/performance
2x jump overnight. new LPU hardware? I checked the speed for groq's gpt-oss-120B, Llama4-maverick, and Llama4-scout; none of them had a noticeable change this month
Heavy quantization
They claim (or someone on Reddit who claims to be staff claims) that's not accurate: https://www.reddit.com/r/LocalLLaMA/comments/1mk4kt0/comment...
There's another angle to this comparison. Groq and Cerebras use custom chips, but I'm not sure about Together. In this case, Together is sharing results based on the B200 GPU. Another important point is the accuracy of these speed-ups compared to the baseline model. It's known that such tricks reduce accuracy, but by how much? Kimi has already benchmarked several providers. https://x.com/Kimi_Moonshot/status/1976926483319763130
> It's known that such tricks reduce accuracy
AFAIU, speculative decoding (and this fancier version of spec. decoding) does not reduce accuracy.
No it shouldn't do. "All" you're doing is having a small model run the prompt and then have the large model "verify" it. When the large model diverges from the small one, you restart the process again.
It’s quantization which is crippling accuracy…
1 reply →
> Groq and Cerebras use custom chips
Not just custom chips, but custom chips which derive much of their performance from enormous amounts of SRAM. There's no denying that approach is fast, but it's also incredibly expensive, and SRAM scaling has slowed to a crawl so it won't get much cheaper any time soon.
This is an "expensive for whom" question. I'd be keen to know if they're burning investor money hosting these right now or if they're able to run these at cost.
> You'll see Groq averaging 1,086tps
What I don't understand is, Groq reporting 200tps for the same model: https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...
OpenRouter numbers look fishy.
Wonder if it’s prompt caching? OpenRouter is (I guess) just reporting actual throughput, where presumably groq is reporting a from-scratch figure? Just a guess tho.
groq is quantizing, even though it's not labeled as such on openrouter (super frustrating)
Do you have a source for that? They are pretty close to the ref implementation on moonshot’s ranking
https://groq.com/blog/inside-the-lpu-deconstructing-groq-spe...
But Groq/Cerebras are hardware accelerators. It's an unrelated optimization. I wouldn't be surprised if they could also use speculators (today or in the future).
>Groq and Cerebras often feel like the only games in town.
SambaNova should be similar...they've got a similar specialized hardware approach
Do these numbers compare performance at the same cost?
You can see the cost in the links, and the answer is “pretty much” for the consumer. The backend maths, no idea.