Comment by mlyle
6 hours ago
With embedded SRAM close, you get startling amounts of bandwidth -- Cerebras claims to attain >2 bytes/FLOP in practice -- vs H200 attaining more like 0.001-0.002 to the external DRAM. So we're talking about a 3 order of magnitude difference.
Would it be a little better with on-wafer distributed DRAM and sophisticated prefetch? Sure, but it wouldn't match SRAM, and you'd end up with a lot more interconnect and associated logic. And, of course, there's no clear path to run on a leading logic process and embed DRAM cells.
In turn, you batch for inference on H200, where Cerebras can get full performance with very small batch sizes.
No comments yet
Contribute on Hacker News ↗