Comment by torginus
7 days ago
I wonder if it's feasible to hook up NAND flash with a high bandwidth link necessary for inference.
Each of these NAND chips hundreds of dies of flash stacked inside, and they are hooked up to the same data line, so just 1 of them can talk at the same time, and they still achieve >1GB/s bandwidth. If you could hook them up in parallel, you could have 100s of GBs of bandwidth per chip.
NAND is very, very slow relative to RAM, so you'd pay a huge performance penalty there. But maybe more importantly my impression is that memory contents mutate pretty heavily during inference (you're not just storing the fixed weights), so I'd be pretty concerned about NAND wear. Mutating a single bit on a NAND chip a million times over just results in a large pile of dead NAND chips.
No it's not slow - a single NAND chip in SSDs offers >1GB of bandwidth - inside the chip there are 100+ wafers actually holding the data, but in SSDs only one of them is active when reading/writing.
You could probably make special NAND chips where all of them can be active at the same time, which means you could get 100GB+ bandwidth out of a single chip.
This would be useless for data storage scenarios, but very useful when you have huge amounts of static data you need to read quickly.
The memory bandwidth on an H100 is 3TB/s, for reference. This number is the limiting factor in the size of modern LLMs. 100GB/s isn't even in the realm of viability.
2 replies →