Comment by DoctorOetker

3 days ago

Is there a reason GPU's don't use insane "blocks" of sdcard slots (for massively parallel io) so the model weights don't need to pass through a limited PCI bus?

13 comments

DoctorOetker

Neywiny 3 days ago

Yes. Let's do the math. The fastest sd cards can read at around 300 MB/s (https://havecamerawilltravel.com/fastest-sd-cards/). Modern GPUs use 16 lanes of PCIe gen 5, which is 16x32Gb/s = 512Gb/s = 64 GB/s. Meaning you'd need over 200 of the fastest SD cards. So what you're asking is: is there a reason GPUs don't use 200 SD cards? And I can't think of any way that would work

hedgehog 3 days ago
SD is obviously the wrong interface for this but "High Bandwidth Flash" (stacked flash akin to HBM) is in development for exactly this kind of problem. AMD actually made a GPU with onboard flash maybe a decade ago but I think it was a bit early. Today I would love to have a pool of 50GB/s storage attached to the GPU.
- jiggawatts 3 days ago
  
  First gen HBF is targeting something like 1.2 TB/s!
  
  1 reply →
Dylan16807 3 days ago
One thing to note, those aren't the fastest SD cards, those are the fastest UHS-II SD cards. The future is SD Express and you can already get microSDs at 900 MB/s.
- Neywiny 3 days ago
  
  SD expires cards are still on my watch list after seeing this https://www.theverge.com/2021/9/9/22665216/sd-express-card-s.... But even then, that's only 3x faster, so you're still putting down 71 ish (I think my original number was 213.33) cards, which means 71 PCIe PHYs and NVMe stacks.
  
  1 reply →
magicalhippo 3 days ago
Some years ago I realized that if I had oodles of money to spend I would totally get someone to make a PCIe card with like several hundreds microSD cards on it.
You can buy vertical microSD connectors, so you can stack quite a lot of them on a PCIe card. Then a beefy FPGA to present it as a NVMe device to the host.
Goal total capacity, as you can put 1TB cards in there. And for teh lulz of course.
- 15155 3 days ago
  
  This isn't a very difficult thing to build, but I am curious - what's the point? Who is the market?
  
  1 reply →

jiggawatts 3 days ago

The next gen inference chips will use High Bandwidth Flash (HBF) to store model weights.

These are made similarly to HBM but are lower power and much higher capacity. They can also be used for caching to reduce costs when processing long chat sessions.

numpad0 3 days ago

Maybe latency. IIRC flash is a lot laggier than DRAMs and SRAMs.

DoctorOetker 3 days ago

The random access memory models is not really representative of ML workloads (both training and inference), where multiplying large tensors result in predictable memory access patterns.