Comment by trhway
5 hours ago
CPU cache is understandably SRAM.
>The whole point of putting memory close is to increase performance and bandwidth, and DRAM is fundamentally latent.
When the access patterns are well established and understood, like in the case of transformers, you can mitigate latency by prefetch (we can even have very beefed up prefetch pipeline knowing that we target transformers), while putting memory on the same chip gives you huge number of data lines thus resulting in huge bandwidth.
With embedded SRAM close, you get startling amounts of bandwidth -- Cerebras claims to attain >2 bytes/FLOP in practice -- vs H200 attaining more like 0.001-0.002 to the external DRAM. So we're talking about a 3 order of magnitude difference.
Would it be a little better with on-wafer distributed DRAM and sophisticated prefetch? Sure, but it wouldn't match SRAM, and you'd end up with a lot more interconnect and associated logic. And, of course, there's no clear path to run on a leading logic process and embed DRAM cells.
In turn, you batch for inference on H200, where Cerebras can get full performance with very small batch sizes.