Comment by DoctorOetker
3 days ago
Is there a reason GPU's don't use insane "blocks" of sdcard slots (for massively parallel io) so the model weights don't need to pass through a limited PCI bus?
3 days ago
Is there a reason GPU's don't use insane "blocks" of sdcard slots (for massively parallel io) so the model weights don't need to pass through a limited PCI bus?
Yes. Let's do the math. The fastest sd cards can read at around 300 MB/s (https://havecamerawilltravel.com/fastest-sd-cards/). Modern GPUs use 16 lanes of PCIe gen 5, which is 16x32Gb/s = 512Gb/s = 64 GB/s. Meaning you'd need over 200 of the fastest SD cards. So what you're asking is: is there a reason GPUs don't use 200 SD cards? And I can't think of any way that would work
SD is obviously the wrong interface for this but "High Bandwidth Flash" (stacked flash akin to HBM) is in development for exactly this kind of problem. AMD actually made a GPU with onboard flash maybe a decade ago but I think it was a bit early. Today I would love to have a pool of 50GB/s storage attached to the GPU.
First gen HBF is targeting something like 1.2 TB/s!
1 reply →
One thing to note, those aren't the fastest SD cards, those are the fastest UHS-II SD cards. The future is SD Express and you can already get microSDs at 900 MB/s.
SD expires cards are still on my watch list after seeing this https://www.theverge.com/2021/9/9/22665216/sd-express-card-s.... But even then, that's only 3x faster, so you're still putting down 71 ish (I think my original number was 213.33) cards, which means 71 PCIe PHYs and NVMe stacks.
1 reply →
Some years ago I realized that if I had oodles of money to spend I would totally get someone to make a PCIe card with like several hundreds microSD cards on it.
You can buy vertical microSD connectors, so you can stack quite a lot of them on a PCIe card. Then a beefy FPGA to present it as a NVMe device to the host.
Goal total capacity, as you can put 1TB cards in there. And for teh lulz of course.
This isn't a very difficult thing to build, but I am curious - what's the point? Who is the market?
1 reply →
The next gen inference chips will use High Bandwidth Flash (HBF) to store model weights.
These are made similarly to HBM but are lower power and much higher capacity. They can also be used for caching to reduce costs when processing long chat sessions.
Maybe latency. IIRC flash is a lot laggier than DRAMs and SRAMs.
The random access memory models is not really representative of ML workloads (both training and inference), where multiplying large tensors result in predictable memory access patterns.