Comment by digiown

1 month ago

I assume the use case is that you are an inference provider, and you put a bunch of models you might want to serve in the HBF to be able to quickly swap them in and out on demand.

5 comments

digiown

jauntywundrkind 1 month ago

I think the hope is to run directly off of HBF directly, to eventually replace RAM with it entirely. 1.5TB/s is a pretty solid number! It's not going to be easy, it doesn't just drop in and replace (vastly bigger latency) but HBF replacing HBM for gobs of bandwidth is the intent, I believe.

Kioxia & Nvidia are already talking about 100M IOps SSD's directly attached to GPUs. This is less about running hte model & more about offboarding context for future use, but Nvidia is pushing KV cache to ssd. And using BlueField-4 which has PCIe on it to attach SSDs, process there. https://blocksandfiles.com/2025/09/15/kioxia-100-million-iop... https://blocksandfiles.com/2026/01/06/nvidia-standardizes-gp... https://developer.nvidia.com/blog/introducing-nvidia-bluefie...

We've already deepseek running straight off NVMe, weights runnig there. Slowly, but this maybe could scale. https://www.reddit.com/r/LocalLLaMA/comments/1idseqb/deepsee...

Kioxia for example has AiSAQ, which works in a couple places such as Milvus; not 100% clear but me exactly what's going on there, but it's trying to push work to the NVMe. And with NVMe 2.1 having computational storage, I expect we see more pushing work to the SSD.

These aren't directly the same thing as HBF. A lot is caching, but also, I tend to think there is an aspiration of trying to move some work out of ram, not merely to be able to load into ram faster.

amelius 1 month ago
Flash has limited write cycles. The faster you write, the faster it wears out. How do you overcome that?
- nutjob2 1 month ago
  
  The other thing having many channels gives you is the ability to have much larger drive sizes, which fixes that problem.
  
  1 reply →
goinghjuk 1 month ago

they will probably use a simpler more direct protocol than NVMe