Comment by anonzzzies

10 hours ago

From this thread [0] I can assume that because, while 1.6T, it is A49B, it can run (theoretically, very slow maybe) locally on consumer hardeware, or is that wrong?

[0] https://news.ycombinator.com/item?id=47864835

3 comments

anonzzzies

alecco 8 hours ago

If 5090 has 32GB, and let's say somehow a 1-bit quantization is possible and you don't need more VRAM for anything else (forget KV cache etc), it would be able to fit a 256B 1-bit model. Just to picture it in extremes how unlikely this is.

And the active parameters come from the experts. For each token the model picks some experts to run the pass (usually 2 to 4, I haven't read V4's papers). It's not always the same experts.

OTOH, being DeepSeek, I foresee a bunch of V4 distilled FP8 models fitting in a 5090 with tiny batches and with performance close from 75 to 85% of V4. And this might be good enough for many everyday tasks.

Today is a good day for open models. Thank god for DeepSeek.

Quasimarion 9 hours ago

Theoretically with streaming, any model that fit the disk can run on consumer hardware, just terribly slow.

imrebuild 4 hours ago

It will be Seconds Per Token instead of Tokens Per Second.