Comment by SwellJoe

14 hours ago

A trillion parameters is wild. That's not going to quantize to anything normal folks can run. Even at 1-bit, it's going to be bigger than what a Strix Halo or DGX Spark can run. Though I guess streaming from system RAM and disk makes it feasible to run it locally at <1 token per second, or whatever. GLM 5.1, at 754B parameters, is already beyond any reasonable self-hosting hardware (1-bit quantization is 206GB). Maybe a Mac Studio with 512GB can run them at very low-bit quantizations, also pretty slowly.

3 comments

SwellJoe

justinclift 8 hours ago

Looks like it. This quant ( https://huggingface.co/inferencerlabs/Kimi-K2.6-MLX-3.6bit ) says:

> Q3.6 typically achieves useable accuracy in our coding test and fits within a 512GB memory budget

This one ( https://huggingface.co/mlx-community/Kimi-K2.6-MoE-Smart-Qua... ) though says it fits on a 192GB mac:

> M3/M4 Ultra 192GB+ (fits in ~150GB)

jauntywundrkind 13 hours ago

A huge dual socket Epyc system used to be able to get to 1TB without difficulty. 16 dimms of 64gb each. Doable for ~$3000. With considerable memory bandwidth.

Our hope these days seems to be that maybe perhaps possibly High Bandwidth Flash works out. Instead of 4, 8, or maybe more for some highest end drives, having many many many dozens of channels of flash.

Ideally that can be very very near to the inference. PCIe 7.0 is 0.5Tb/s at 16x which is obviously nowhere remotely near enough throughout here. The difficulty is sort of that nand has been trying to be super dense, so as you scale channels you would normally tend to scale nand capacity too, and now instead of a 2tb drive you have a 200tb drive prices way beyond consumer means. Still, I think HBF is perhaps the only shot of the most important thing in computing going from mainframe back to consumer, and of course the models are going to balloon again if this dies hit, probably before consumers ever get a chance.

segmondy 4 hours ago

You can't buy 16 64gb dimms for $3000. Go shop memory prices again. But yes an old epyc can run this with no GPU at reasonable speed and if you throw a few GPUs you can get very manageable speed. I run this at home on an old system PCIe4, slow 2400mhz ddr4 ram and still getting about 13tk/sec