Comment by immortal3

1 day ago

There's another angle to this comparison. Groq and Cerebras use custom chips, but I'm not sure about Together. In this case, Together is sharing results based on the B200 GPU. Another important point is the accuracy of these speed-ups compared to the baseline model. It's known that such tricks reduce accuracy, but by how much? Kimi has already benchmarked several providers. https://x.com/Kimi_Moonshot/status/1976926483319763130

6 comments

immortal3

rfoo 1 day ago

> It's known that such tricks reduce accuracy

AFAIU, speculative decoding (and this fancier version of spec. decoding) does not reduce accuracy.

martinald 1 day ago

No it shouldn't do. "All" you're doing is having a small model run the prompt and then have the large model "verify" it. When the large model diverges from the small one, you restart the process again.
Der_Einzige 1 day ago
It’s quantization which is crippling accuracy…
- petesergeant 12 hours ago
  
  People all over this subthread saying that with no evidence provided. The company say they don’t — which would be pretty embarrassing to have to walk back — so who’s saying they do?

jsheard 1 day ago

> Groq and Cerebras use custom chips

Not just custom chips, but custom chips which derive much of their performance from enormous amounts of SRAM. There's no denying that approach is fast, but it's also incredibly expensive, and SRAM scaling has slowed to a crawl so it won't get much cheaper any time soon.

petesergeant 1 day ago

This is an "expensive for whom" question. I'd be keen to know if they're burning investor money hosting these right now or if they're able to run these at cost.