← Back to context

Comment by armcat

4 hours ago

Out of curiosity, what kind of specs do you have (GPU / RAM)? I saw the requirements and it's a beyond my budget so I am "stuck" with smaller Qwen coders.

I'm not running it locally (it's gigantic!) I'm using the API at https://platform.moonshot.ai

  • Just curious - how does it compare to GLM 4.7? Ever since they gave the $28/year deal, I've been using it for personal projects and am very happy with it (via opencode).

    https://z.ai/subscribe

    • There's no comparison. GLM 4.7 is fine and reasonably competent at writing code, but K2.5 is right up there with something like Sonnet 4.5. it's the first time I can use an open-source model and not immediately tell the difference between it and top-end models from Anthropic and OpenAI.

    • It's waaay better than GLM 4.7 (which was the open model I was using earlier)! Kimi was able to quickly and smoothly finish some very complex tasks that GLM completely choked at.

    • From what people say, it's better than GLM 4.7 (and I guess DeepSeek 3.2)

      But it's also like... 10x the price per output token on any of the providers I've looked at.

      I don't feel it's 10x the value. It's still much cheaper than paying by the token for Sonnet or Opus, but if you have a subscribed plan from the Big 3 (OpenAI, Anthropic, Google) it's much better value for $$.

      Comes down to ethical or openness reasons to use it I guess.

      1 reply →

  • How long until this can be run on consumer grade hardware or a domestic electricity supply I wonder.

    Anyone have a projection?

    • You can run it on consumer grade hardware right now, but it will be rather slow. NVMe SSDs these days have a read speed of 7 GB/s (EDIT: or even faster than that! Thank you @hedgehog for the update), so it will give you one token roughly every three seconds while crunching through the 32 billion active parameters, which are natively quantized to 4 bit each. If you want to run it faster, you have to spend more money.

      Some people in the localllama subreddit have built systems which run large models at more decent speeds: https://www.reddit.com/r/LocalLLaMA/

      1 reply →

    • You can run it on a mac studio with 512gb ram, that's the easiest way. I run it at home on a multi rig GPU with partial offload to ram.

      1 reply →

    • You need 600gb of VRAM + MEMORY (+ DISK) to fit the model (full) or 240 for the 1b quantized model. Of course this will be slow.

      Through moonshot api it is pretty fast (much much much faster than Gemini 3 pro and Claude sonnet, probably faster than Gemini flash), though. To get similar experience they say at least 4xH200.

      If you don't mind running it super slow, you still need around 600gb of VRAM + fast RAM.

      It's already possible to run 4xH200 in a domestic environment (it would be instantaneous for most tasks, unbelievable speed). It's just very very expensive and probably challenging for most users, manageable/easy for the average hacker news crowd.

      Expensive AND hard to source high end GPUs, if you manage to source for the old prices around 200 thousand dollars to get maximum speed I guess, you could probably run decently on a bunch of high end machines, for let's say, 40k (slow).

Just pick up any >240GB VRAM GPU off your local BestBuy to run a quantized version.

> The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs.

  • You could run the full, unquantized model at high speed with 8 RTX 6000 Blackwell boards.

    I don't see a way to put together a decent system of that scale for less than $100K, given RAM and SSD prices. A system with 4x H200s would cost more like $200K.