Our only modification part is that, if the Software (or any derivative works
thereof) is used for any of your commercial products or services that have
more than 100 million monthly active users, or more than 20 million US dollars
(or equivalent in other currencies) in monthly revenue, you shall prominently
display "Kimi K2.5" on the user interface of such product or service.
One. Trillion. Even on native int4 that’s… half a terabyte of vram?!
Technical awe at this marvel aside that cracks the 50th percentile of HLE, the snarky part of me says there’s only half the danger in giving something away nobody can run at home anyway…
The model absolutely can be run at home. There even is a big community around running large models locally: https://www.reddit.com/r/LocalLLaMA/
The cheapest way is to stream it from a fast SSD, but it will be quite slow (one token every few seconds).
The next step up is an old server with lots of RAM and many memory channels with maybe a GPU thrown in for faster prompt processing (low two digits tokens/second).
At the high end, there are servers with multiple GPUs with lots of VRAM or multiple chained Macs or Strix Halo mini PCs.
The key enabler here is that the models are MoE (Mixture of Experts), which means that only a small(ish) part of the model is required to compute the next token. In this case, there are 32B active parameters, which is about 16GB at 4 bit per parameter. This only leaves the question of how to get those 16GB to the processor as fast as possible.
VRAM is the new moat, and controlling pricing and access to VRAM is part of it. There will be very few hobbyists who can run models of this size. I appreciate the spirit of making the weights open, but realistically, it is impractical for >99.999% of users to run locally.
Hey have they open sourced all Kimi k2.5 (thinking,instruct,agent,agent swarm [beta])?
Because I feel like they mentioned that agent swarm is available their api and that made me feel as if it wasn't open (weights)*? Please let me know if all are open source or not?
> or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2.5" on the user interface of such product or service.
Why not just say "you shall pay us 1 million dollars"?
Companies with $20M revenue will not normally have spare $1M available. They'd get more money by charging reasonable subscriptions than by using lawyers to chase sudden company-ending fees.
Coincidence or not, let's just marvel for a second over this amount of magic/technology that's being given away for free... and how liberating and different this is than OpenAI and others that were closed to "protect us all".
There's been so many moments that folks not really heavy into LLM have missed,
DeepSeekR1 was great, but so was all the "incremental" improvements, v3-0324, v3.1, v3.1-terminus, and now v3.2-speciale. With that this is the 3rd great Kimi model, then GLM has been awesome, since 4.5, with 4.5, 4.5-air, 4.6, 4.7 and now 4.7 flash. Minimax-M2 has also been making waves lately. ... and i'm just talking about the Chinese model without adding the 10+ Qwen models. Outside of Chinese models, mistral-small/devstral, gemma-27b-it, gpt-oss-120b, seed-os have been great, and I'm still talking about just LLM, not image, audio or special domain models like deepseek-prover and deepseek-math. It's really a marvel what we have at home. I cancelled OpenAI and Anthropic subscription 2 years ago once they started calling for regulation of open models and I haven't missed them one bit.
Chinese state that maybe sees open collaboration as the way to nullify any US lead in the field, concurrently if the next "search-winner" is built upon their model the Chinese worldview that Taiwan belongs to China and Tiamen Square massacre never happened.
Also their license says that if you have a big product you need to promote them, remember how Google "gave away" site searche widgets and that was perhaps one of the major ways they gained recognition for being the search leader.
OpenAI/NVidia is the Pets.com/Sun of our generation, insane valuations, stupid spend, expensive options, expensive hardware and so on.
Sun hardware bought for 50k USD to run websites in 2000 are less capable than perhaps 5 dollar/month VPS's today?
"Scaling to AGI/ASI" was always a fools errand, best case OpenAI should've squirreled away money to have a solid engineering department that could focus on algorithmic innovations but considering that Antrophic, Google and Chinese firms have caught up or surpassed them it seems they didn't.
Once things blows up, those closed options that had somewhat sane/solid model research that handles things better will be left and a ton of new competitors running modern/cheaper hardware and just using models are building blocks.
Speculating: there are two connected businesses here, creating the models, and serving the models. Outside of a few moneyed outliers, no one is going to run this at home. So at worst opening this model allows mid-sized competitors to serve it to customers from their own infra -- which helps Kimi gain mindshare, particularly against the large incumbents who are definitely not going to be serving Kimi and so don't benefit from its openness.
Given the shallowness of moats in the LLM market, optimizing for mindshare would not be the worst move.
Moonshot’s (Kimi’s owner) investors are Alibaba/Tencent et al. Chinese market is stupidly competitive, and there’s a general attitude of “household name will take it all”. However getting there requires having a WeChat-esque user base, through one way or another. If it’s paid, there’ll be friction and it won’t work. Plus, it undermines a lot of other companies, which is a win for a lot of people.
I think there is a book (Chip War) about how the USSR did not effectively participate in staying at the edge of the semiconductor revolution. And they have suffered for it.
China has decided they are going to participate in the LLM/AGI/etc revolution at any cost. So it is a sunk cost, and the models are just an end product and any revenue is validation and great, but not essential. The cheaper price points keep their models used and relevant. It challenges the other (US, EU) models to innovate and keep ahead to justify their higher valuations (both monthly plan, and investor). Once those advances are made, it can be bought back to their own models. In effect, the currently leading models are running from a second place candidate who never gets tired and eventually does what they do at a lower price point.
All economically transformative technologies have done similar. If it's privatized, it's not gonna be transformative across the industry. The GPS, the internet, touchscreens, AI voice assistants, microchips, LCDs, etc were all publicly funded (or made by Bell Labs which had a state-mandated monopoly that forced them to open up their patents).
The economist Mariana Mazzucato wrote a great book about this called The Entrepreneurial State: Debunking Public vs. Private Sector Myths
> What amazes me is why would someone spend millions to train this model and give it away for free. What is the business here?
How many millions did Google spend on Android (acquisition and salaries), only to give it away for free?
Usually, companies do this to break into a monopolized market (or one that's at risk of becoming one), with openness as a sweetener. IBM with Linux to break UNIX-on-big-iron domination, Google with Android vs. iPhone, Sun with OpenSolaris vs. Linux-on-x86.
It's another state project funded at the discretion of the party.
If you look at past state projects, profitability wasn't really considered much. They are notorious for a "Money hose until a diamond is found in the mountains of waste"
I am convinced that was mostly just marketing. No one uses deepseek as far as I can tell. People are not running it locally. People choose GPT/Gemini/Claude/Grok if you are giving your data away anyway.
My biggest source of my conspiracy is that I made a reddit thread asking a question: "Why all the deepseek hype" or something like that. And to this day, I get odd, 'pro deepseek' comments from accounts only used every few months. Its not like this was some highly upvoted topic that is in the 'Top'.
I'd put that deepseek marketing on-par with an Apple marketing campaign.
Except that, In OpenRouter, Deepseek always maintain in Top 10 Ranking. Although I did not use it personally, i believe that their main advantage over other model is price/performance.
I mean, there are credible safety issues here. A Kimi fine-tune will absolutely be able to help people do cybersecurity related attacks - very good ones.
In a few years, or less, biological attacks and other sorts of attacks will be plausible with the help of these agents.
1,500 tool calls per task sounds like a nightmare for unit economics though. I've been optimizing my own agent workflows and even a few dozen steps makes it hard to keep margins positive, so I'm not sure how this is viable for anyone not burning VC cash.
An LLM model only outputs tokens, so this could be seen as an extension of tool calling where it has trained on the knowledge and use-cases for "tool-calling" itself as a sub-agent.
One thing caught my eyes is that besides K2.5 model, Moonshot AI also launched Kimi Code (https://www.kimi.com/code), evolved from Kimi CLI. It is a terminal coding agent, I've been used it last month with Kimi subscription, it is capable agent with stable harness.
It is, Kimi Code CLI supports Zed' Agent Client Protocol (http://agentclientprotocol.com/), so it can acts as an external agent that could run in any ACP-compatible client, eg: Zed, Jetbrain, Toad CLI, Minano Notebook. Also, it supports Agent Skills. Moonshot AI developers are actively update the agent and every active. I really like their CLI.
Anecdotally, I've cancelled my Claude Code subscription after using Kimi K2.5 and Kimi CLI for the last few days. It's handled everything I've thrown at it. It is slower at the moment, but I expect that will improve.
Have you all noted that the latest releases (Qwen3 max thinking, now Kimi k2.5) from Chinese companies are benching against Claude opus now and not Sonnet? They are truly catching up, almost at the same pace?
Kimi 2 is remarkably consistently the best. I wonder if it's somehow been trained specifically on tasks like these. It seems too consistent to be coincidence
Also shocking is how the most common runner up I've seen is DeepSeek
This is just a conspiracy theory/urban legend. How do you "distill" a proprietary model with no access to the original weights? Just doing the equivalent of training on chat/API logs has terrible effectiveness (you're trying to drink from a giant firehose through a tiny straw) and gives you no underlying improvements.
Yes, they do distill. But just saying all they do is distill is not correct and actually kind of unfair. These Chinese labs have done lots of research in this field and publish it to the public, some of not majority contribute with open-weight models making a future of local llm possible! Deepseek, Moonshot, Minimax, Z.a, Alibabai (Qwen).
They are not just leeching here, they took this innovation, refined it and improved it further. This is what the Chinese is good at.
I've read several people say that Kimi K2 has a better "emotional intelligence" than other models. I'll be interested to see whether K2.5 continues or even improves on that.
Yup, I experience the same. I don't know what they do to achieve this but it gives them this edge, really curious to learn more about what makes it so good at it.
A lot of people point to the Muon optimizer that Moonshot (the creators of Kimi) pioneered. Compared to the standard optimizer AdamW, Muon amplifies low-magnitude gradient directions which makes the model learn faster (and maybe gives Kimi its unique qualities).
The directionally interesting part is that according to the announcement, K2.5 seems to be trained specifically to create sub-agents and work in an agent swarm usefully. The key part is that you don't need to manually create or prompt sub-agents, K2.5 creates them automatically, so from the looks of things it's similar to Claude Code dynamic sub-agents except the model is trained to scale to many more agents autonomously.
I wonder whether Claude is doing the same kind of training and it's coming with the next model, and that's why the agent swarm mode in Claude Code is hidden for now. We might be getting very very good agent orchestrators/swarms very soon.
I parsed "reasonable" as in having reasonable speed to actually use this as intended (in agentic setups). In that case, it's a minimum of 70-100k for hardware (8x 6000 PRO + all the other pieces to make it work). The model comes with native INT4 quant, so ~600GB for the weights alone. An 8x 96GB setup would give you ~160GB for kv caching.
You can of course "run" this on cheaper hardware, but the speeds will not be suitable for actual use (i.e. minutes for a simple prompt, tens of minutes for high context sessions per turn).
You might want to clarify that this is more of a "Look it technically works"
Not a "I actually use this"
The difference between waiting 20 minutes to answer the prompt '1+1='
and actually using it for something useful is massive here. I wonder where this idea of running AI on CPU comes from. Was it Apple astroturfing? Was it Apple fanboys? I don't see people wasting time on non-Apple CPUs. (Although, I did do this for a 7B model)
K2 0905 and K2 Thinking shortly after that have done impressively well in my personal use cases and was severely slept on. Faster, more accurate, less expensive, more flexible in terms of hosting and available months before Gemini 3 Flash, I really struggle to understand why Flash got such positive attention at launch.
Interested in the dedicated Agent and Agent Swarm releases, especially in how that could affect third party hosting of the models.
Why is that Claude still at the top in coding, are they heavily focused on training for coding or is it their general training is so good that it performs well in coding?
Someone please beat the Opus 4.5 in coding, I want to replace it.
I don't think that kind of difference in benchmarks has any meaning at all. Your agentic coding tool and the task you are working on introduce a lot more "noise" than that small delta.
Also consider they are all overfitting on the benchmark itself so there might be that as well (which can go in either directions)
I consider the top models practically identical for coding applications (just personal experience with heavy use of both GPT5.2 and Opus 4.5).
Excited to see how this model compares in real applications. It's 1/5th of the price of top models!!
I replaced Opus with Gemini Pro and it's just plain a better coder IMO. It'll restructure code to enable support for new requirements where Opus seems to just pile on more indirection layers by default, when it doesn't outright hardcode special cases inside existing functions, or drop the cases it's failing to support from the requirements while smugly informing you you don't need that anyway.
A realistic setup for this would be a 16× H100 80GB with NVLink. That comfortably handles the active 32B experts plus KV cache without extreme quantization. Cost-wise we are looking at roughly $500k–$700k upfront or $40–60/hr on-demand, which makes it clear this model is aimed at serious infra teams, not casual single-GPU deployments. I’m curious how API providers will price tokens on top of that hardware reality.
You can do it and may be ok for single user with idle waiting times, but performance/throughput will be roughly halved (closer to 2/3) and free context will be more limited with 8xH200 vs 16xH100 (assuming decent interconnect). Depending a bit on usecase and workload 16xH100 (or 16xB200) may be a better config for cost optimization. Often there is a huge economy of scale with such large mixture of expert models so that it would even be cheaper to use 96 GPU instead of just 8 or 16. The reasons are complicatet and involve better prefill cache, less memory transfer per node.
The other realistic setup is $20k, for a small company that needs a private AI for coding or other internal agentic use with two Mac Studios connected over thunderbolt 5 RMDA.
That won’t realistically work for this model. Even with only ~32B active params, a 1T-scale MoE still needs the full expert set available for fast routing, which means hundreds of GB to TBs of weights resident. Mac Studios don’t share unified memory across machines, Thunderbolt isn’t remotely comparable to NVLink for expert exchange, and bandwidth becomes the bottleneck immediately. You could maybe load fragments experimentally, but inference would be impractically slow and brittle. It’s a very different class of workload than private coding models.
That's great for affordable local use but it'll be slow: even with the proper multi-node inference setup, the thunderbolt link will be a comparative bottleneck.
As your local vision nut, their claims about "SOTA" vision are absolutely BS in my tests.
Sure it's SOTA at standard vision benchmarks. But on tasks that require proper image understanding, see for example BabyVision[0] it appears very much lacking compared to Gemini 3 Pro.
I don't get this "agent swarm" concept. You set up a task and they boot up 100 LLMs to try to do it in parallel, and then one "LLM judge" puts it all together? Is there anywhere I can read more about it?
You can read about this basically everywhere - the term of art is agent orchestration. Gas town, Claude’s secret swarm mode, or people who like to use phrases like “Wiggum loop” will get you there.
If you’re really lazy - the quick summary is that you can benefit from the sweet spot of context length and reduce instruction overload while getting some parallelism benefits from farming tasks out to LLMs with different instructions. The way this is generally implemented today is through tool calling, although Claude also has a skills interface it has been trained against.
So the idea would be for software development, why not have a project/product manager spin out tasks to a bunch of agents that are primed to be good at different things? E.g. an architect, a designer, and so on. Then you just need something that can rectify GitHub PRs and bob’s your uncle.
Gas town takes a different approach and parallelizes on coding tasks of any sort at the base layer, and uses the orchestration infrastructure to keep those coders working constantly, optimizing for minimal human input.
I'm not sure whether there are parts of this done for claude but those other ones are layers on top of the usual LLMs we see. This seems to be a bit different, in that there's a different model trained specifically for splitting up and managing the workload.
I've also been quite skeptical, and I became even more skeptical after hearing a tech talk from a startup in this space [1].
I think the best way to think about it is that its an engineering hack to deal with a shortcoming of LLMs: for complex queries LLMs are unable to directly compute a SOLUTION given a PROMPT, but are instead able to break down the prompt to intermediate solutions and eventually solve the original prompt. These "orchestrator" / "swarm" agents add some formalism to this and allow you to distribute compute, and then also use specialized models for some of the sub problems.
But in the end, isn't this the same idea with the MoE?
Where we have more specialized "jobs", which the model is actually trained for.
I think the main difference with agents swarm is the ability to run them in parallel. I don't see how this adds much compared to simply sending multiple API calls in parallel with your desired tasks. I guess the only difference is that you let the AI decide how to split those requests and what each task should be.
Can we please stop calling those models "open source"? Yes the weights are open. So, "open weight" maybe. But the source isn't open, the thing that allows to re-create it. That's what "open source" used to mean. (Together with a license that allows you to use that source for various things.)
No major AI lab will admit to training on proprietary or copyrighted data so what you are asking is an impossibility. You can make a pretty good LLM if you train on Anna's Archive but it will either be released anonymously, or with a research only non commercial license.
There aren't enough public domain data to create good LLMs, especially once you get into the newer benchmarks that expect PhD level of domain expertise in various niche verticals.
It's also a logical impossibility to create a zero knowledge proof that will allow you to attribute to specific training data without admitting to usage.
I can think of a few technical options but none would hold water legally.
You can use a Σ-protocol OR-composition to prove that it was trained either on a copyrighted dataset or a non copyrighted dataset without admitting to which one (technically interesting, legally unsound).
You can prove that a model trained on copywrited data is statistically indistinguishable from one trained on non-copywrited data (an information theoretic impossibility unless there exist as much public domain data as copywrited data, in similar distributions).
You can prove a public domain and copywrited dataset are equivalent if the model performance produced is indistinguishable from each other.
All the proofs fail irl, ignoring the legal implications, because there's less public domain information, so given the lemma that more training data == improved model performance, all the above are close to impossible.
Is there a startup that takes models like this, and effectively gives you a secure setup, where you have (a) a mobile app that (b) talks to some giant machine that only you have access too.
If a 10K computer could run this, it may be worth it to have a "fully on prem" version of ChatGPT running for you.
I had these weird situations like some models are refusing to use SSH as a tool. Not sure if it was the coding tool limitation or it is baked into in some of the models.
Is this actually good or just optimized heavily for benchmarks? I am hopefully its the former based on the writeup but need to put it through its paces.
Glad to to see open source models are catching up and treat vision as first-class citizen (a.k.a native multimodal agentic model). GLM and Qwen models takes different approach, by having a base model and a vision variant (glm-4.6 vs glm-4.6v).
I guess after Kimi K2.5, other vendors are going to the same route?
Can't wait to see how this model performs on computer automation use cases like
VITA AI Coworker.
The post actually has great benchmark tables inside of it. They might be outdated in a few months, but for now, it gives you a great summary. Seems like Gemini wins on image and video perf, Claude is the best at coding, ChatGPT is the best for general knowledge.
But ultimately, you need to try them yourself on the tasks you care about and just see. My personal experience is that right now, Gemini Pro performs the best at everything I throw at it. I think it's superior to Claude and all of the OSS models by a small margin, even for things like coding.
I like Gemini Pro's UI over Claude so much but honestly I might start using Kimi K2.5 if its open source & just +/- Gemini Pro/Chatgpt/Claude because at that point I feel like the results are negligible and we are getting SOTA open source models again.
There are many lists, but I find all of them outdated or containing wrong information or missing the actual benchmarks I'm looking for.
I was thinking, that maybe it's better to make my own benchmarks with the questions/things I'm interested in, and whenever a new model comes out run those tests with that model using open-router.
The label 'open source' has become a reputation reaping and marketing vehicle rather than an informative term since the Hugging Face benchmark race started. With the weights only, we cannot actually audit that if a model is a) contaminated by benchmarks, b) built with deliberate biases, or c) trained on copyrighted/privacy data, let alone allowing other vendors to replicate the results. Anyways, people still love free stuff.
Just accept that IP laws don't matter and the old "free software" paradigm is dead. Aaron Swartz died so that GenAI may live. RMS and his model of "copyleft" are so Web 1.0 (not even 2.0). No one in GenAI cares AT ALL about the true definition of open source. Good.
Huggingface Link: https://huggingface.co/moonshotai/Kimi-K2.5
1T parameters, 32b active parameters.
License: MIT with the following modification:
Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or services that have more than 100 million monthly active users, or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2.5" on the user interface of such product or service.
One. Trillion. Even on native int4 that’s… half a terabyte of vram?!
Technical awe at this marvel aside that cracks the 50th percentile of HLE, the snarky part of me says there’s only half the danger in giving something away nobody can run at home anyway…
The model absolutely can be run at home. There even is a big community around running large models locally: https://www.reddit.com/r/LocalLLaMA/
The cheapest way is to stream it from a fast SSD, but it will be quite slow (one token every few seconds).
The next step up is an old server with lots of RAM and many memory channels with maybe a GPU thrown in for faster prompt processing (low two digits tokens/second).
At the high end, there are servers with multiple GPUs with lots of VRAM or multiple chained Macs or Strix Halo mini PCs.
The key enabler here is that the models are MoE (Mixture of Experts), which means that only a small(ish) part of the model is required to compute the next token. In this case, there are 32B active parameters, which is about 16GB at 4 bit per parameter. This only leaves the question of how to get those 16GB to the processor as fast as possible.
44 replies →
Which conveniently fits on one 8xH100 machine. With 100-200 GB left over for overhead, kv-cache, etc.
2 replies →
VRAM is the new moat, and controlling pricing and access to VRAM is part of it. There will be very few hobbyists who can run models of this size. I appreciate the spirit of making the weights open, but realistically, it is impractical for >99.999% of users to run locally.
I run KimiK2 at home, Most of it on system ram with a few layers offloaded to old 3090s. This is a cheap budget build.
Kimi-K2-Thinking-UD-Q3_K_XL-00001-of-00010.gguf Generation - 5,231 tokens 604.63s 8.65 tokens/s
2 replies →
that's what intelligence takes. Most of intelligence is just compute
3,998.99 for 500gb of RAM on amazon
"Good Luck" - Kimi <Taken voice>
Cursor devs, who go out of their way to not mention their Composer model is based on GLM, are not going to like that.
Source? I've heard this rumour twice but never seen proof. I assume it would be based on tokeniser quirks?
Hey have they open sourced all Kimi k2.5 (thinking,instruct,agent,agent swarm [beta])?
Because I feel like they mentioned that agent swarm is available their api and that made me feel as if it wasn't open (weights)*? Please let me know if all are open source or not?
I'm assuming the swarm part is all harness. Well I mean a harness and way of thinking that the weights have just been fine tuned to use.
1 reply →
> or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2.5" on the user interface of such product or service.
Why not just say "you shall pay us 1 million dollars"?
? They prefer the branding. The license just says you have to say it was them if you make > $250mm a year on the model.
Companies with $20M revenue will not normally have spare $1M available. They'd get more money by charging reasonable subscriptions than by using lawyers to chase sudden company-ending fees.
1 reply →
I assume this allows them to sue for different amounts. And not discourage too many people from using it.
The "Deepseek moment" is just one year ago today!
Coincidence or not, let's just marvel for a second over this amount of magic/technology that's being given away for free... and how liberating and different this is than OpenAI and others that were closed to "protect us all".
There's been so many moments that folks not really heavy into LLM have missed, DeepSeekR1 was great, but so was all the "incremental" improvements, v3-0324, v3.1, v3.1-terminus, and now v3.2-speciale. With that this is the 3rd great Kimi model, then GLM has been awesome, since 4.5, with 4.5, 4.5-air, 4.6, 4.7 and now 4.7 flash. Minimax-M2 has also been making waves lately. ... and i'm just talking about the Chinese model without adding the 10+ Qwen models. Outside of Chinese models, mistral-small/devstral, gemma-27b-it, gpt-oss-120b, seed-os have been great, and I'm still talking about just LLM, not image, audio or special domain models like deepseek-prover and deepseek-math. It's really a marvel what we have at home. I cancelled OpenAI and Anthropic subscription 2 years ago once they started calling for regulation of open models and I haven't missed them one bit.
What's your hardware/software setup?
It’s not coincidence. Chinese companies tend to do big releases before Chinese new year. So expect more to come before Feb 17.
What amazes me is why would someone spend millions to train this model and give it away for free. What is the business here?
Chinese state that maybe sees open collaboration as the way to nullify any US lead in the field, concurrently if the next "search-winner" is built upon their model the Chinese worldview that Taiwan belongs to China and Tiamen Square massacre never happened.
Also their license says that if you have a big product you need to promote them, remember how Google "gave away" site searche widgets and that was perhaps one of the major ways they gained recognition for being the search leader.
OpenAI/NVidia is the Pets.com/Sun of our generation, insane valuations, stupid spend, expensive options, expensive hardware and so on.
Sun hardware bought for 50k USD to run websites in 2000 are less capable than perhaps 5 dollar/month VPS's today?
"Scaling to AGI/ASI" was always a fools errand, best case OpenAI should've squirreled away money to have a solid engineering department that could focus on algorithmic innovations but considering that Antrophic, Google and Chinese firms have caught up or surpassed them it seems they didn't.
Once things blows up, those closed options that had somewhat sane/solid model research that handles things better will be left and a ton of new competitors running modern/cheaper hardware and just using models are building blocks.
13 replies →
Speculating: there are two connected businesses here, creating the models, and serving the models. Outside of a few moneyed outliers, no one is going to run this at home. So at worst opening this model allows mid-sized competitors to serve it to customers from their own infra -- which helps Kimi gain mindshare, particularly against the large incumbents who are definitely not going to be serving Kimi and so don't benefit from its openness.
Given the shallowness of moats in the LLM market, optimizing for mindshare would not be the worst move.
Moonshot’s (Kimi’s owner) investors are Alibaba/Tencent et al. Chinese market is stupidly competitive, and there’s a general attitude of “household name will take it all”. However getting there requires having a WeChat-esque user base, through one way or another. If it’s paid, there’ll be friction and it won’t work. Plus, it undermines a lot of other companies, which is a win for a lot of people.
I think this fits into some "Commoditize The Complement" strategy.
https://gwern.net/complement
I think there is a book (Chip War) about how the USSR did not effectively participate in staying at the edge of the semiconductor revolution. And they have suffered for it.
China has decided they are going to participate in the LLM/AGI/etc revolution at any cost. So it is a sunk cost, and the models are just an end product and any revenue is validation and great, but not essential. The cheaper price points keep their models used and relevant. It challenges the other (US, EU) models to innovate and keep ahead to justify their higher valuations (both monthly plan, and investor). Once those advances are made, it can be bought back to their own models. In effect, the currently leading models are running from a second place candidate who never gets tired and eventually does what they do at a lower price point.
1 reply →
All economically transformative technologies have done similar. If it's privatized, it's not gonna be transformative across the industry. The GPS, the internet, touchscreens, AI voice assistants, microchips, LCDs, etc were all publicly funded (or made by Bell Labs which had a state-mandated monopoly that forced them to open up their patents).
The economist Mariana Mazzucato wrote a great book about this called The Entrepreneurial State: Debunking Public vs. Private Sector Myths
> What amazes me is why would someone spend millions to train this model and give it away for free. What is the business here?
How many millions did Google spend on Android (acquisition and salaries), only to give it away for free?
Usually, companies do this to break into a monopolized market (or one that's at risk of becoming one), with openness as a sweetener. IBM with Linux to break UNIX-on-big-iron domination, Google with Android vs. iPhone, Sun with OpenSolaris vs. Linux-on-x86.
Hosting the model is cheaper per token, the more batched token you get. So they have big advantage here.
Curious to hear what “OpenAI” thinks the answer to this is
It's another state project funded at the discretion of the party.
If you look at past state projects, profitability wasn't really considered much. They are notorious for a "Money hose until a diamond is found in the mountains of waste"
I am convinced that was mostly just marketing. No one uses deepseek as far as I can tell. People are not running it locally. People choose GPT/Gemini/Claude/Grok if you are giving your data away anyway.
My biggest source of my conspiracy is that I made a reddit thread asking a question: "Why all the deepseek hype" or something like that. And to this day, I get odd, 'pro deepseek' comments from accounts only used every few months. Its not like this was some highly upvoted topic that is in the 'Top'.
I'd put that deepseek marketing on-par with an Apple marketing campaign.
I don't use DeepSeek, but I prefer Kimi and GLM to closed models for most of my work.
Except that, In OpenRouter, Deepseek always maintain in Top 10 Ranking. Although I did not use it personally, i believe that their main advantage over other model is price/performance.
1 reply →
I mean, there are credible safety issues here. A Kimi fine-tune will absolutely be able to help people do cybersecurity related attacks - very good ones.
In a few years, or less, biological attacks and other sorts of attacks will be plausible with the help of these agents.
Chinese companies aren't humanitarian endeavors.
[dead]
> For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls.
> K2.5 Agent Swarm improves performance on complex tasks through parallel, specialized execution [..] leads to an 80% reduction in end-to-end runtime
Not just RL on tool calling, but RL on agent orchestration, neat!
1,500 tool calls per task sounds like a nightmare for unit economics though. I've been optimizing my own agent workflows and even a few dozen steps makes it hard to keep margins positive, so I'm not sure how this is viable for anyone not burning VC cash.
"tool call" is just a reference to any elementary interaction with the outside system. It's not calling third-party APIs or anything like that.
2 replies →
> Kimi K2.5 can self-direct an agent swarm
Is this within the model? Or within the IDE/service that runs the model?
Because tool calling is mostly just the agent outputting "call tool X", and the IDE does it and returns the data back to AI's context
An LLM model only outputs tokens, so this could be seen as an extension of tool calling where it has trained on the knowledge and use-cases for "tool-calling" itself as a sub-agent.
3 replies →
Parallel agents are such a simple, yet powerful hack. Using it in Claude Code with TeammateTool and getting lots of good results!
> TeammateTool
What is this?
2 replies →
I posted this elsewhere but thought I'd repost here:
* https://lmarena.ai/leaderboard — crowd-sourced head-to-head battles between models using ELO
* https://dashboard.safe.ai/ — CAIS' incredible dashboard
* https://clocks.brianmoore.com/ — a visual comparison of how well models can draw a clock. A new clock is drawn every minute
* https://eqbench.com/ — emotional intelligence benchmarks for LLMs
* https://www.ocrarena.ai/battle — OCR battles, ELO
* https://mafia-arena.com/ — LLMs playing the social deduction game Mafia
* https://openrouter.ai/rankings — marketshare based on OpenRouter
One thing caught my eyes is that besides K2.5 model, Moonshot AI also launched Kimi Code (https://www.kimi.com/code), evolved from Kimi CLI. It is a terminal coding agent, I've been used it last month with Kimi subscription, it is capable agent with stable harness.
GitHub: https://github.com/MoonshotAI/kimi-cli
>Kimi Code CLI is not only a coding agent, but also a shell.
That's cool. It also has a zsh hook, allowing you to switch to agent mode wherever you're.
It is, Kimi Code CLI supports Zed' Agent Client Protocol (http://agentclientprotocol.com/), so it can acts as an external agent that could run in any ACP-compatible client, eg: Zed, Jetbrain, Toad CLI, Minano Notebook. Also, it supports Agent Skills. Moonshot AI developers are actively update the agent and every active. I really like their CLI.
Does it support the swarm feature? Does Opencode?
https://github.com/code-yeongyu/oh-my-opencode
How does it fare against CC?
Anecdotally, I've cancelled my Claude Code subscription after using Kimi K2.5 and Kimi CLI for the last few days. It's handled everything I've thrown at it. It is slower at the moment, but I expect that will improve.
Have you all noted that the latest releases (Qwen3 max thinking, now Kimi k2.5) from Chinese companies are benching against Claude opus now and not Sonnet? They are truly catching up, almost at the same pace?
https://clocks.brianmoore.com
K2 is one of the only models to nail the clock face test as well. It’s a great model.
Kimi 2 is remarkably consistently the best. I wonder if it's somehow been trained specifically on tasks like these. It seems too consistent to be coincidence
Also shocking is how the most common runner up I've seen is DeepSeek
It's better than most, but not 100%. As I see this the clock hands are all correct, but the numbers only go 1-8.
Cool comparison, but none of them get both the face and the time correct when I look at it.
1 reply →
They distill the major western models, so anytime a new SOTA model drops, you can expect the Chinese labs to update their models within a few months.
This is just a conspiracy theory/urban legend. How do you "distill" a proprietary model with no access to the original weights? Just doing the equivalent of training on chat/API logs has terrible effectiveness (you're trying to drink from a giant firehose through a tiny straw) and gives you no underlying improvements.
Yes, they do distill. But just saying all they do is distill is not correct and actually kind of unfair. These Chinese labs have done lots of research in this field and publish it to the public, some of not majority contribute with open-weight models making a future of local llm possible! Deepseek, Moonshot, Minimax, Z.a, Alibabai (Qwen).
They are not just leeching here, they took this innovation, refined it and improved it further. This is what the Chinese is good at.
Source?
They are, in benchmarks. In practice Anthropic's models are ahead of where their benchmarks suggest.
Bear in mind that lead may be, in large part, from the tooling rather than the model
The benching is sus, it's way more important to look at real usage scenarios.
I've read several people say that Kimi K2 has a better "emotional intelligence" than other models. I'll be interested to see whether K2.5 continues or even improves on that.
I love the Kimi response style. It's much more concise, without all the unnecessary "great question!"s and other annoying AI stuff
Yup, I experience the same. I don't know what they do to achieve this but it gives them this edge, really curious to learn more about what makes it so good at it.
A lot of people point to the Muon optimizer that Moonshot (the creators of Kimi) pioneered. Compared to the standard optimizer AdamW, Muon amplifies low-magnitude gradient directions which makes the model learn faster (and maybe gives Kimi its unique qualities).
Muon paper: https://arxiv.org/abs/2502.16982
1 reply →
yes, though this is highly subjective - it 'feels' like that to me as well (comapred to Gemini 3, GPT 5.2, Opus 4.5).
I'll test it out on mafia-arena.com once it is available on Open Router
The directionally interesting part is that according to the announcement, K2.5 seems to be trained specifically to create sub-agents and work in an agent swarm usefully. The key part is that you don't need to manually create or prompt sub-agents, K2.5 creates them automatically, so from the looks of things it's similar to Claude Code dynamic sub-agents except the model is trained to scale to many more agents autonomously.
I wonder whether Claude is doing the same kind of training and it's coming with the next model, and that's why the agent swarm mode in Claude Code is hidden for now. We might be getting very very good agent orchestrators/swarms very soon.
Curious what would be the most minimal reasonable hardware one would need to deploy this locally?
I parsed "reasonable" as in having reasonable speed to actually use this as intended (in agentic setups). In that case, it's a minimum of 70-100k for hardware (8x 6000 PRO + all the other pieces to make it work). The model comes with native INT4 quant, so ~600GB for the weights alone. An 8x 96GB setup would give you ~160GB for kv caching.
You can of course "run" this on cheaper hardware, but the speeds will not be suitable for actual use (i.e. minutes for a simple prompt, tens of minutes for high context sessions per turn).
Models of this size can usually be run using MLX on a pair of 512GB Mac Studio M3 Ultras, which are about $10,000 each so $20,000 for the pair.
You might want to clarify that this is more of a "Look it technically works"
Not a "I actually use this"
The difference between waiting 20 minutes to answer the prompt '1+1='
and actually using it for something useful is massive here. I wonder where this idea of running AI on CPU comes from. Was it Apple astroturfing? Was it Apple fanboys? I don't see people wasting time on non-Apple CPUs. (Although, I did do this for a 7B model)
13 replies →
I think you can put a bunch of apple silicon macs with enough ram together
e.g. in an office or coworking space
800-1000 gb ram perhaps?
K2 0905 and K2 Thinking shortly after that have done impressively well in my personal use cases and was severely slept on. Faster, more accurate, less expensive, more flexible in terms of hosting and available months before Gemini 3 Flash, I really struggle to understand why Flash got such positive attention at launch.
Interested in the dedicated Agent and Agent Swarm releases, especially in how that could affect third party hosting of the models.
K2 thinking didn't have vision which was a big drawback for my projects.
Congratulations, great work Kimi team.
Why is that Claude still at the top in coding, are they heavily focused on training for coding or is it their general training is so good that it performs well in coding?
Someone please beat the Opus 4.5 in coding, I want to replace it.
I don't think that kind of difference in benchmarks has any meaning at all. Your agentic coding tool and the task you are working on introduce a lot more "noise" than that small delta.
Also consider they are all overfitting on the benchmark itself so there might be that as well (which can go in either directions)
I consider the top models practically identical for coding applications (just personal experience with heavy use of both GPT5.2 and Opus 4.5).
Excited to see how this model compares in real applications. It's 1/5th of the price of top models!!
I replaced Opus with Gemini Pro and it's just plain a better coder IMO. It'll restructure code to enable support for new requirements where Opus seems to just pile on more indirection layers by default, when it doesn't outright hardcode special cases inside existing functions, or drop the cases it's failing to support from the requirements while smugly informing you you don't need that anyway.
Opus 4.5 only came out two months ago, and yes Anthropic spends a lot of effort making it particularly good at coding.
Gemini 3 pro is way better than Opus especially for large codebases.
Do you use it only for code editing, or also for running bash commands? My experience is that it is very bad at the latter.
My experience is the total opposite.
Kimi was already one of the best writing models. Excited to try this one out
To me, Kimi has been the best with writing and conversing, its way more human like!
Pretty cute pelican https://tools.simonwillison.net/svg-render#%3Csvg%20viewBox%...
Oops, here's a working link: https://gist.github.com/simonw/32a85e337fbc6ee935d10d89726c0...
doesn't work, looks like the link or SVG was cropped.
No pelican for me :(
About 600GB needed for weights alone, so on AWS you need an p5.48xlarge (8× H100) which costs $55/hour.
A realistic setup for this would be a 16× H100 80GB with NVLink. That comfortably handles the active 32B experts plus KV cache without extreme quantization. Cost-wise we are looking at roughly $500k–$700k upfront or $40–60/hr on-demand, which makes it clear this model is aimed at serious infra teams, not casual single-GPU deployments. I’m curious how API providers will price tokens on top of that hardware reality.
The weights are int4, so you'd only need 8xH100
You don't need to wait and see, Kimi K2 has the same hardware requirements and has several providers on OpenRouter:
https://openrouter.ai/moonshotai/kimi-k2-thinking https://openrouter.ai/moonshotai/kimi-k2-0905 https://openrouter.ai/moonshotai/kimi-k2-0905:exacto https://openrouter.ai/moonshotai/kimi-k2
Generally it seems to be in the neighborhood of $0.50/1M for input and $2.50/1M for output
Generally speaking, 8xH200s will be a lot cheaper than 16xH100s, and faster too. But both should technically work.
You can do it and may be ok for single user with idle waiting times, but performance/throughput will be roughly halved (closer to 2/3) and free context will be more limited with 8xH200 vs 16xH100 (assuming decent interconnect). Depending a bit on usecase and workload 16xH100 (or 16xB200) may be a better config for cost optimization. Often there is a huge economy of scale with such large mixture of expert models so that it would even be cheaper to use 96 GPU instead of just 8 or 16. The reasons are complicatet and involve better prefill cache, less memory transfer per node.
The other realistic setup is $20k, for a small company that needs a private AI for coding or other internal agentic use with two Mac Studios connected over thunderbolt 5 RMDA.
That won’t realistically work for this model. Even with only ~32B active params, a 1T-scale MoE still needs the full expert set available for fast routing, which means hundreds of GB to TBs of weights resident. Mac Studios don’t share unified memory across machines, Thunderbolt isn’t remotely comparable to NVLink for expert exchange, and bandwidth becomes the bottleneck immediately. You could maybe load fragments experimentally, but inference would be impractically slow and brittle. It’s a very different class of workload than private coding models.
5 replies →
I'd love to see the prompt processing speed difference between 16× H100 and 2× Mac Studio.
3 replies →
That's great for affordable local use but it'll be slow: even with the proper multi-node inference setup, the thunderbolt link will be a comparative bottleneck.
As your local vision nut, their claims about "SOTA" vision are absolutely BS in my tests.
Sure it's SOTA at standard vision benchmarks. But on tasks that require proper image understanding, see for example BabyVision[0] it appears very much lacking compared to Gemini 3 Pro.
[0] https://arxiv.org/html/2601.06521v1
Gemini remains the only usable vision fm :(
https://archive.is/P98JR
I don't get this "agent swarm" concept. You set up a task and they boot up 100 LLMs to try to do it in parallel, and then one "LLM judge" puts it all together? Is there anywhere I can read more about it?
You can read about this basically everywhere - the term of art is agent orchestration. Gas town, Claude’s secret swarm mode, or people who like to use phrases like “Wiggum loop” will get you there.
If you’re really lazy - the quick summary is that you can benefit from the sweet spot of context length and reduce instruction overload while getting some parallelism benefits from farming tasks out to LLMs with different instructions. The way this is generally implemented today is through tool calling, although Claude also has a skills interface it has been trained against.
So the idea would be for software development, why not have a project/product manager spin out tasks to a bunch of agents that are primed to be good at different things? E.g. an architect, a designer, and so on. Then you just need something that can rectify GitHub PRs and bob’s your uncle.
Gas town takes a different approach and parallelizes on coding tasks of any sort at the base layer, and uses the orchestration infrastructure to keep those coders working constantly, optimizing for minimal human input.
I'm not sure whether there are parts of this done for claude but those other ones are layers on top of the usual LLMs we see. This seems to be a bit different, in that there's a different model trained specifically for splitting up and managing the workload.
I've also been quite skeptical, and I became even more skeptical after hearing a tech talk from a startup in this space [1].
I think the best way to think about it is that its an engineering hack to deal with a shortcoming of LLMs: for complex queries LLMs are unable to directly compute a SOLUTION given a PROMPT, but are instead able to break down the prompt to intermediate solutions and eventually solve the original prompt. These "orchestrator" / "swarm" agents add some formalism to this and allow you to distribute compute, and then also use specialized models for some of the sub problems.
[1] https://www.deepflow.com/
You have a team lead that establishes a list of tasks that are needed to achieve your mission
then it creates a list of employees, each of them is specialized for a task, and they work in parallel.
Essentially hiring a team of people who get specialized on one problem.
Do one thing and do it well.
But in the end, isn't this the same idea with the MoE?
Where we have more specialized "jobs", which the model is actually trained for.
I think the main difference with agents swarm is the ability to run them in parallel. I don't see how this adds much compared to simply sending multiple API calls in parallel with your desired tasks. I guess the only difference is that you let the AI decide how to split those requests and what each task should be.
4 replies →
The datacenters yearn for the chips.
Running on Apple Silicon: https://x.com/awnihannun/status/2016221496084205965
CCP-bench has gotten WAY better on K2.5!
https://big-agi.com/static/kimi-k2.5-less-censored.jpg
Can we please stop calling those models "open source"? Yes the weights are open. So, "open weight" maybe. But the source isn't open, the thing that allows to re-create it. That's what "open source" used to mean. (Together with a license that allows you to use that source for various things.)
No major AI lab will admit to training on proprietary or copyrighted data so what you are asking is an impossibility. You can make a pretty good LLM if you train on Anna's Archive but it will either be released anonymously, or with a research only non commercial license.
There aren't enough public domain data to create good LLMs, especially once you get into the newer benchmarks that expect PhD level of domain expertise in various niche verticals.
It's also a logical impossibility to create a zero knowledge proof that will allow you to attribute to specific training data without admitting to usage.
I can think of a few technical options but none would hold water legally.
You can use a Σ-protocol OR-composition to prove that it was trained either on a copyrighted dataset or a non copyrighted dataset without admitting to which one (technically interesting, legally unsound).
You can prove that a model trained on copywrited data is statistically indistinguishable from one trained on non-copywrited data (an information theoretic impossibility unless there exist as much public domain data as copywrited data, in similar distributions).
You can prove a public domain and copywrited dataset are equivalent if the model performance produced is indistinguishable from each other.
All the proofs fail irl, ignoring the legal implications, because there's less public domain information, so given the lemma that more training data == improved model performance, all the above are close to impossible.
Those are some impressive benchmark results. I wonder how well it does in real life.
Maybe we can get away with something cheaper than Claude for coding.
I'm curious about the "cheaper" claim -- I checked Kimi pricing, and it's a $200/mo subscription too?
On openrouter 2.5 is at 0.60/3$ per Mtok. That's haiku pricing.
1 reply →
They also have a $20 and $40 tier.
5 replies →
Is there a startup that takes models like this, and effectively gives you a secure setup, where you have (a) a mobile app that (b) talks to some giant machine that only you have access too.
If a 10K computer could run this, it may be worth it to have a "fully on prem" version of ChatGPT running for you.
I had these weird situations like some models are refusing to use SSH as a tool. Not sure if it was the coding tool limitation or it is baked into in some of the models.
Is this actually good or just optimized heavily for benchmarks? I am hopefully its the former based on the writeup but need to put it through its paces.
Quite good in my testing
Glad to to see open source models are catching up and treat vision as first-class citizen (a.k.a native multimodal agentic model). GLM and Qwen models takes different approach, by having a base model and a vision variant (glm-4.6 vs glm-4.6v).
I guess after Kimi K2.5, other vendors are going to the same route?
Can't wait to see how this model performs on computer automation use cases like VITA AI Coworker.
https://www.vita-ai.net/
There are so many models, is there any website with list of all of them and comparison of performance on different tasks?
The post actually has great benchmark tables inside of it. They might be outdated in a few months, but for now, it gives you a great summary. Seems like Gemini wins on image and video perf, Claude is the best at coding, ChatGPT is the best for general knowledge.
But ultimately, you need to try them yourself on the tasks you care about and just see. My personal experience is that right now, Gemini Pro performs the best at everything I throw at it. I think it's superior to Claude and all of the OSS models by a small margin, even for things like coding.
I like Gemini Pro's UI over Claude so much but honestly I might start using Kimi K2.5 if its open source & just +/- Gemini Pro/Chatgpt/Claude because at that point I feel like the results are negligible and we are getting SOTA open source models again.
2 replies →
There is https://artificialanalysis.ai
There are many lists, but I find all of them outdated or containing wrong information or missing the actual benchmarks I'm looking for.
I was thinking, that maybe it's better to make my own benchmarks with the questions/things I'm interested in, and whenever a new model comes out run those tests with that model using open-router.
Thank you! Exactly what I was looking for
they cooked
Actually open source, or yet another public model, which is the equivalent of a binary?
URL is down so cannot tell.
It's open weights, not open source.
The label 'open source' has become a reputation reaping and marketing vehicle rather than an informative term since the Hugging Face benchmark race started. With the weights only, we cannot actually audit that if a model is a) contaminated by benchmarks, b) built with deliberate biases, or c) trained on copyrighted/privacy data, let alone allowing other vendors to replicate the results. Anyways, people still love free stuff.
Just accept that IP laws don't matter and the old "free software" paradigm is dead. Aaron Swartz died so that GenAI may live. RMS and his model of "copyleft" are so Web 1.0 (not even 2.0). No one in GenAI cares AT ALL about the true definition of open source. Good.
1 reply →
[dead]
[dead]
Cool
The chefs at Moonshot have cooked once again.