Comment by walrus01

9 hours ago

People thinking to self-host Kimi K2.6 had better be prepared for how big it is.

Q8 K XL quantization for instance is around 600GB on disk. I would bet about 700GB of VRAM needed.

Quantizations lower than Q8 are probably worthless for quality.

Or 2.05TB on disk for the full precision GGUF.

https://huggingface.co/unsloth/Kimi-K2.6-GGUF

If you can afford the hardware to run Kimi K2.6 at any decent speed for more than 1 simultaneous user, you probably have a whole team of people on staff who are already very familiar with how to benchmark it vs Claude, GPT-5.5, etc.

9 comments

walrus01

adrian_b 4 hours ago

While most people would not be able to run Kimi K2.6 fast enough for a chat, as a coding assistant the low speed matters much less, especially when many tasks can be batched to progress during a single pass over the weights.

If you run it on your own hardware, you can run it 24/7 without worrying about token price or reaching the subscription limits and it is likely that you can do more work, even on much slower hardware. Customizing an open-source harness can also provide a much greater efficiency than something like Claude Code.

For any serious application, you might be more limited by your ability to review the code, than by hardware speed.

zozbot234 3 hours ago

DeepSeek V4 Pro is way more effective at batching multiple tasks together since the KV cache is so much lighter - a max of ~10GB at full 1M context, and in a linear proportion with context according to the DeepSeek V4 release paper. That's extremely impressive, it unlocks batching, agent swarms etc. even on severely memory-constrained platforms, especially at smaller max context.

zozbot234 9 hours ago

Kimi is a natively quantized model, the lossless full precision release is 595GB. Your own link mentions that.

CamperBob2 9 hours ago
So, realistically, $100K for an 8x RTX 6000 Pro system that can run it at a usable rate.
- zozbot234 9 hours ago
  
  I think people will always disagree on what qualifies as a "usable rate". But keep in mind that practically no one sensible is running the latest Opus or GPT around the clock, especially not at sustainable, unsubsidized prices. With open-weights models it's easy to do that.
  
  1 reply →
walrus01 8 hours ago
the 'unsloth' link above is a 3rd party person that has quantized it to Q8, the original release is considerably larger in size than 600GB:
https://huggingface.co/moonshotai/Kimi-K2.6
- adrian_b 4 hours ago
  
  No.
  I have downloaded Kimi-K2.6 (the original release).
  du -sh moonshotai/Kimi-K2.6 555G moonshotai/Kimi-K2.6 du -s moonshotai/Kimi-K2.6 581255612 moonshotai/Kimi-K2.6
  For comparison (sorted in decreasing sizes, 3 bigger models and 3 smaller models, all are recently launched):
  du -sh zai-org/GLM-5.1 1.4T zai-org/GLM-5.1 du -sh XiaomiMiMo/MiMo-V2.5-Pro 963G XiaomiMiMo/MiMo-V2.5-Pro du -sh deepseek-ai/DeepSeek-V4-Pro 806G deepseek-ai/DeepSeek-V4-Pro du -sh XiaomiMiMo/MiMo-V2.5 295G XiaomiMiMo/MiMo-V2.5 du -sh MiniMaxAI/MiniMax-M2.7 215G MiniMaxAI/MiniMax-M2.7 du -sh deepseek-ai/DeepSeek-V4-Flash 149G deepseek-ai/DeepSeek-V4-Flash
- zozbot234 8 hours ago
  
  That page mentions that the model is natively INT4 for most of the params, and 600GB is in the ballpark of what's available there for download.