Comment by anon373839

21 hours ago

> This is not a local model for any reasonable definition of local

That's true for now. I am hopeful that once the hardware markets have recovered from OpenAI's sabotage, we will see more hardware dedicated to local inference that can handle these big models.

Also, I'm thinking about the unique MoE routing that Apple is using with their new Apple Foundation Model. The model is trained and architected so that experts are not swapped for every token, but only occasionally. This suggests that e.g., a 744B parameter model in the future could have experts offloaded to SSD and still run with the effective computing requirements of a 40B model.

8 comments

anon373839

timschmidt 15 hours ago

Reading weights out of memory is the definition of a large linear read. I'm a bit mystified someone hasn't put an embarrassingly parallel flash storage controller next to some tensor processors on a PCIe card. It could have 4Tb of flash hanging off enough channels to saturate SRAM skipping DRAM entirely, and could even offload prompt processing to a GPU in the same workstation so long as it got reasonable tokens/s in inference. I'd buy one tomorrow.

adrian_b 14 hours ago
For the last year, there has been development work at several companies for products including HBF (high-bandwidth flash memory) as a supplement to HBM, in order to enable running inference for big LLMs at a reasonable cost, e.g. on one GPU-like card.
HBF was initially announced by SanDisk, early in 2025, then early this year Hynix has announced that they have joined SanDisk in producing HBF, and that the common specification will be standardized under the Open Compute Project.
With HBF, it would be easy to make a GPU card with 4 TB of HBF, which could run the biggest existing open weights LLMs in their native unquantized form.
- timschmidt 14 hours ago
  
  Exciting news! This is how I see running frontier models at home becoming reasonably affordable. Though it may take a depreciation cycle or two.
zozbot234 12 hours ago

For sparse MoE models, the single expert layers that the inference gets sampled from are actually quite small - single-digit megabytes or so.

tshaddox 16 hours ago

Is there reason to expect the consumer hardware markets to recover any time soon?

Is there reason to expect they’ll ever recover without an AI bust that takes down the U.S. economy?

20after4 15 hours ago

I don't think it'll ever recover. Partially perhaps. But we have bigger problems to worry about really.

zozbot234 21 hours ago

Normally, experts are picked for every layer not just every token. But there are plausible ways of getting around that bottleneck while streaming if you can batch many inferences together. Still, the Apple approach of swapping the experts only rarely is interesting, though it likely degrades the model a lot.

FridgeSeal 19 hours ago

Just get the bigger models to figure out the architecture required for hot-swappable sub-experts without loss of performance!
Got all those tokens, isn’t that the point of auto research and friends??
(Only sort of joking).