Comment by DoctorOetker

3 days ago

Is there a reason GPU's don't use insane "blocks" of sdcard slots (for massively parallel io) so the model weights don't need to pass through a limited PCI bus?

Yes. Let's do the math. The fastest sd cards can read at around 300 MB/s (https://havecamerawilltravel.com/fastest-sd-cards/). Modern GPUs use 16 lanes of PCIe gen 5, which is 16x32Gb/s = 512Gb/s = 64 GB/s. Meaning you'd need over 200 of the fastest SD cards. So what you're asking is: is there a reason GPUs don't use 200 SD cards? And I can't think of any way that would work

  • SD is obviously the wrong interface for this but "High Bandwidth Flash" (stacked flash akin to HBM) is in development for exactly this kind of problem. AMD actually made a GPU with onboard flash maybe a decade ago but I think it was a bit early. Today I would love to have a pool of 50GB/s storage attached to the GPU.

  • One thing to note, those aren't the fastest SD cards, those are the fastest UHS-II SD cards. The future is SD Express and you can already get microSDs at 900 MB/s.

  • Some years ago I realized that if I had oodles of money to spend I would totally get someone to make a PCIe card with like several hundreds microSD cards on it.

    You can buy vertical microSD connectors, so you can stack quite a lot of them on a PCIe card. Then a beefy FPGA to present it as a NVMe device to the host.

    Goal total capacity, as you can put 1TB cards in there. And for teh lulz of course.

The next gen inference chips will use High Bandwidth Flash (HBF) to store model weights.

These are made similarly to HBM but are lower power and much higher capacity. They can also be used for caching to reduce costs when processing long chat sessions.

Maybe latency. IIRC flash is a lot laggier than DRAMs and SRAMs.

  • The random access memory models is not really representative of ML workloads (both training and inference), where multiplying large tensors result in predictable memory access patterns.