Comment by mlyle

14 days ago

With embedded SRAM close, you get startling amounts of bandwidth -- Cerebras claims to attain >2 bytes/FLOP in practice -- vs H200 attaining more like 0.001-0.002 to the external DRAM. So we're talking about a 3 order of magnitude difference.

Would it be a little better with on-wafer distributed DRAM and sophisticated prefetch? Sure, but it wouldn't match SRAM, and you'd end up with a lot more interconnect and associated logic. And, of course, there's no clear path to run on a leading logic process and embed DRAM cells.

In turn, you batch for inference on H200, where Cerebras can get full performance with very small batch sizes.

0 comments

mlyle

No comments yet

Contribute on Hacker News ↗