Comment by petesergeant

1 day ago

> Built on top of Together Turbo Speculator, ATLAS reaches up to 500 TPS on DeepSeek-V3.1 and up to 460 TPS on Kimi-K2 in a fully adapted scenario — 2.65x faster than standard decoding, outperforming even specialized hardware like Groq

and yet, if you click on: https://openrouter.ai/moonshotai/kimi-k2-0905

You'll see Groq averaging 1,086tps vs Together doing 59tps. Groq and Cerebras often feel like the only games in town. I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.

Comparing how quickly gpt-oss-120b runs gives a broader picture: https://openrouter.ai/openai/gpt-oss-120b -- Vertex (Google) and SambaNova do pretty good on it too, but still, the difference between a top provider and an also-ran is giant.

God I love OpenRouter.

21 comments

petesergeant

KronisLV 1 day ago

> I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.

I'm currently on the Cerebras Code subscription for like 50 USD a month because it more or less makes the rate limits I used to deal with other platforms disappear (without making me spend upwards of 100 USD paying per token): https://www.cerebras.ai/blog/introducing-cerebras-code

At the same time, their Qwen Coder 480B model is fine but I still find myself going for Claude or GPT-5 or Gemini 2.5 Pro for more complex issues (or ones where I need good usage of Latvian language), at least for programming tasks it'd eventually be super cool if they could offer more models.

Or have some sort of a partnership with Anthropic or whoever, because getting my questions answered at around 500-1500 TPS is really, really pleasant, especially for agentic use cases with code modifications, even if I still bump into the 128k context limits occasionally.

meander_water 1 day ago

Interesting, if you take a look at the median throughput chart [0], groq goes insane after 7th Oct. Wonder what happened.

[0] https://openrouter.ai/moonshotai/kimi-k2-0905/performance

sigmar 1 day ago

2x jump overnight. new LPU hardware? I checked the speed for groq's gpt-oss-120B, Llama4-maverick, and Llama4-scout; none of them had a noticeable change this month
awestroke 1 day ago
Heavy quantization
- petesergeant 1 day ago
  
  They claim (or someone on Reddit who claims to be staff claims) that's not accurate: https://www.reddit.com/r/LocalLLaMA/comments/1mk4kt0/comment...

immortal3 1 day ago

There's another angle to this comparison. Groq and Cerebras use custom chips, but I'm not sure about Together. In this case, Together is sharing results based on the B200 GPU. Another important point is the accuracy of these speed-ups compared to the baseline model. It's known that such tricks reduce accuracy, but by how much? Kimi has already benchmarked several providers. https://x.com/Kimi_Moonshot/status/1976926483319763130

rfoo 1 day ago
> It's known that such tricks reduce accuracy
AFAIU, speculative decoding (and this fancier version of spec. decoding) does not reduce accuracy.
- martinald 1 day ago
  
  No it shouldn't do. "All" you're doing is having a small model run the prompt and then have the large model "verify" it. When the large model diverges from the small one, you restart the process again.
- Der_Einzige 1 day ago
  
  It’s quantization which is crippling accuracy…
  
  1 reply →
jsheard 1 day ago
> Groq and Cerebras use custom chips
Not just custom chips, but custom chips which derive much of their performance from enormous amounts of SRAM. There's no denying that approach is fast, but it's also incredibly expensive, and SRAM scaling has slowed to a crawl so it won't get much cheaper any time soon.
- petesergeant 1 day ago
  
  This is an "expensive for whom" question. I'd be keen to know if they're burning investor money hosting these right now or if they're able to run these at cost.

senko 1 day ago

> You'll see Groq averaging 1,086tps

What I don't understand is, Groq reporting 200tps for the same model: https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...

OpenRouter numbers look fishy.

petesergeant 15 hours ago

Wonder if it’s prompt caching? OpenRouter is (I guess) just reporting actual throughput, where presumably groq is reporting a from-scratch figure? Just a guess tho.

jbellis 1 day ago

groq is quantizing, even though it's not labeled as such on openrouter (super frustrating)

bn-l 1 day ago
Do you have a source for that? They are pretty close to the ref implementation on moonshot’s ranking
- jbellis 9 hours ago
  
  https://groq.com/blog/inside-the-lpu-deconstructing-groq-spe...

alecco 1 day ago

But Groq/Cerebras are hardware accelerators. It's an unrelated optimization. I wouldn't be surprised if they could also use speculators (today or in the future).

Havoc 1 day ago

>Groq and Cerebras often feel like the only games in town.

SambaNova should be similar...they've got a similar specialized hardware approach

p1esk 1 day ago

Do these numbers compare performance at the same cost?

petesergeant 1 day ago

You can see the cost in the links, and the answer is “pretty much” for the consumer. The backend maths, no idea.