Comment by jmalicki

1 month ago

This is done that way at the GPU layer of abstraction - generally (with some exceptions!) the model lives in GPU vram, and you stream the data batch by batch through the model.

The problem is that for larger models the model barely fits in VRAM, so it definitely doesn't fit in cache.

Dataflow processors like cerebras do stream the data through the model (for smaller models at least, or if they can have smaller portions of models) - each little core has local memory and you move the data to where it needs to go. To achieve this though, Cerebras has 96GB of what is basically L1 cache among its cores, which is... a lot of SRAM.