Comment by jauntywundrkind

1 month ago

I think the hope is to run directly off of HBF directly, to eventually replace RAM with it entirely. 1.5TB/s is a pretty solid number! It's not going to be easy, it doesn't just drop in and replace (vastly bigger latency) but HBF replacing HBM for gobs of bandwidth is the intent, I believe.

Kioxia & Nvidia are already talking about 100M IOps SSD's directly attached to GPUs. This is less about running hte model & more about offboarding context for future use, but Nvidia is pushing KV cache to ssd. And using BlueField-4 which has PCIe on it to attach SSDs, process there. https://blocksandfiles.com/2025/09/15/kioxia-100-million-iop... https://blocksandfiles.com/2026/01/06/nvidia-standardizes-gp... https://developer.nvidia.com/blog/introducing-nvidia-bluefie...

We've already deepseek running straight off NVMe, weights runnig there. Slowly, but this maybe could scale. https://www.reddit.com/r/LocalLLaMA/comments/1idseqb/deepsee...

Kioxia for example has AiSAQ, which works in a couple places such as Milvus; not 100% clear but me exactly what's going on there, but it's trying to push work to the NVMe. And with NVMe 2.1 having computational storage, I expect we see more pushing work to the SSD.

These aren't directly the same thing as HBF. A lot is caching, but also, I tend to think there is an aspiration of trying to move some work out of ram, not merely to be able to load into ram faster.

4 comments

jauntywundrkind

amelius 1 month ago

Flash has limited write cycles. The faster you write, the faster it wears out. How do you overcome that?

nutjob2 1 month ago
The other thing having many channels gives you is the ability to have much larger drive sizes, which fixes that problem.
- amelius 1 month ago
  
  Yeah but where is your advantage now?

goinghjuk 1 month ago

they will probably use a simpler more direct protocol than NVMe