Dave Citron here, from the MAI team. Thanks for the feedback, we're getting the model card updated to call out 5B active parameters (137B total).
On benchmarks: in the same VS Code harness, MAI-Code-1-Flash scored 51.2% on SWE-bench Pro vs. Haiku's 35.2% which we see as a pretty big leap. But going forward, we'll include additional models in our benchmarks, including models like Qwen 3.6 and Gemma 4.
Have you run it through DeepSWE? I understand that's probably a high ask for this class of model, but would be interesting to see regardless.
Even if it can't fully pass much, there are so many tests against most of the scenarios that you can get a fairly rich report beyond the pass@1 stat. See e.g. this DeepSWE report against the Minimax M3 model: https://entrpi.github.io/misc/deep-swe-minimax-m3/
Qwen HAS to be a part of the discussion here, even though Microsoft is a US based entity. Their 30b MoE models absolutely hit way above their weight when paired with the right harness program, and can be ran on "Costco gaming computer" specs when configured correctly in llama.cpp.
Sorry Trump Administration, but while the US has been downloading more ram by throwing data centers at everything and burning up everyone's power and water, China has come out with what's effectively a prototype edge compute capable AI model - regardless of how they built it. And arguably I can tokenmaxx on it just fine at around 30-40 tokens/sec.
And also, ASICs are on the way. Imagine one of those with a heavy hitting model (MoE or otherwise, Qwen or otherwise) installed in a PCIe slot at 10k+ tokens/sec and 75 watts max (maximum wattage deliverable by the PCIe slot alone) for $300-400 USD each.
Sorry/not sorry to rip this whole thing to shreds. But I'm sick and tired of these inefficient LLMs being produced that seemingly can only be offered by subscription from a data center, when I'm running a full AI stack right now (model and all) on my computer at home on a 750 watt max power supply. Microsoft really needs to get with the picture here and compete more with Qwen instead of just the US/EU entities.
Qwen is definitely the model to beat as of Mid 2026. While I didn't benchmark with SWE as my use cases are OpenClaw [1]. I found both Qwen 3.6 35B A3B and more impressively Qwen 3.5 122B A10B starting to be competitive with closed flash models. The NVFP4 quant of the latter is what I'm running now on DGX.
How does qwen compare to deepseek or kimi? I haven't spent much time with qwen but I find deepseek to be mostly comparable to opus for my pet projects. Kimi k2.6 did a lot of stupid stuff and talked to itself a lot "let me do X... Wait, X doesn't make sense because the user explicitly said Y"
Deepseek seems to seek first to understand before going off.
The take away is that this model is a smaller model that competes with Haiku, I would hope they come out with a "Sonnet" competing model, then Opus. I have been wondering why Microsoft is kind of "sleeping" on offering models they themselves have made on Copilot, maybe it was part of their deal with OpenAI? Not sure.
Yes, it's a "smaller" (137B) model that competes with Haiku, but it's basically the performance of Qwen3.6-35B-A3B which is 75% smaller and 98% smaller in terms of active parameters (since it's a mixture of experts model). Microsoft should be comparing its model to good smaller models, not Haiku 4.5.
Qwen-3.6-27b is closer to Claude Opus 4.7 than it is to Haiku 4.5 in a lot of benchmarks - and it's way smaller than Microsoft's new model.
Sure, it competes with Haiku, but it shows how far Microsoft is behind lots of other small models that are available.
Why is Haiku the benchmark though, with code generation don't we primarily care about the quality of the code - not the speed or efficiency at which it's generated?
While I agree directionally, I'll caveat that "cost per token" != "cost per task". In the case of Qwen3.6 it tends to think 1.6x more than Haiku, so the cost of Haiku on the same tasks tends to only be about double. More detail from comparing their Artificial Analysis metrics:
Qwen3.6-35B-A3B vs Claude Haiku 4.5
reasoning mode · AA Intelligence Index v4.0
46.0 ┤ ↖ better — cheaper · smarter · faster
│
│
44.0 ┤ ╭─────╮
│ │ ● │ Qwen3.6-35B-A3B
│ ╰─────╯
42.0 ┤
│
│
40.0 ┤
│
│
38.0 ┤ ╭───╮
│ Claude Haiku 4.5 │ ○ │
│ ╰───╯
36.0 ┤
└┬─────────┬─────────┬─────────┬─────────┬────────┬
$200 $300 $400 $500 $600 $700
x → cost to run the index (USD) lower is better
y → AA intelligence index higher is better
bubble area = output speed (tokens / sec)
╭─────╮ ╭───╮
│ ● │ Qwen ~196 t/s │ ○ │ Haiku ~93 t/s
╰─────╯ ╰───╯
┌─────────────────────┬──────────┬──────────┬───────────┐
│ model │ AA index │ run cost │ out speed │
├─────────────────────┼──────────┼──────────┼───────────┤
│ Qwen3.6-35B-A3B ●│ 43.5 │ $280 │ 196 t/s │
│ Claude Haiku 4.5 ○│ 37.1 │ $620 │ 93 t/s │
└─────────────────────┴──────────┴──────────┴───────────┘
COST PER TOKEN ≠ COST PER TASK
output tokens per index run:
Haiku 4.5 87.3M (79.3M reasoning + 8.0M answer)
Qwen3.6 143.2M (131.7M reasoning + 11.5M answer)
→ Qwen emits 1.64× more output
── output speed (tokens / sec) ────────── raw rate · higher = faster
Qwen3.6 100% ~196 t/s
Haiku 4.5 ~47% ~93 t/s
→ Qwen ~2.1× faster per token
╎ 1.64× more tokens < 2.1× faster rate
▼
── solution speed (per finished answer) ── higher = faster
Qwen3.6 100%
Haiku 4.5 ~78%
→ Qwen ~1.3× FASTER to a solution
SCORECARD
intelligence cost / task speed to solution
Qwen3.6-35B-A3B 43.5 $280 ~1.3× faster
Claude Haiku 4.5 37.1 $620 (slower)
→ Qwen wins all three. The reasoning blow-up (1.64×) is smaller than
the raw-speed edge (2.1×), so Qwen stays ahead per task.
It's a start and I welcome competition but I don't think I ever used small cloud models like Haiku 4.5. They are cute but for serious coding they tend to waste your expensive time.
And this certainly wont bring me back to GitHub Copilot which I cancelled yesterday.
GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs: https://www.reddit.com/r/GithubCopilot
I have since changed to DeekSeek Flash on high which is Sonnet+ level for almost free.
If I feel I still need smarter models I might signup for $20/mo Codex to use GPT 5.5 which, in my opinion, is the best I can access right now.
I use larger models to organize work into a topologically sorted task graph and pin smaller models to the tasks depending on the complexity with a larger model evaluating the work and patching where necessary. This uses haiku quite often for routine work. I’m able to do multi hour highly complex work with superior results and a much lower bill as a result by doing this, with a parent orchestrator able to do a massive labor within a single context window by effectively organizing work and reviewing quality and integrating where needed. I don’t use haiku directly, but it’s often 30-40% of any major efforts token use. This further improves time to completion as well as cost - but I find haiku is better at following literal instructions and plans without “second guessing,” while opus class models second guess in their thinking constantly.
As such, haiku isn’t a waste of my time, it saves enormous amounts of time for me. But I spent a large amount of time building the orchestration system up front and iterating on it to get here. Interestingly i found my experience as a director and later a distinguished engineer gave me the tools to build it and get it working well and reliably end to end - the dynamics of multi agent workflows of varying capability is not a lot different than the dynamics of a 1000 engineer organization.
Got anything from your orchestrator you could share that’s usable by others? Sounds like how I’d like to work but is difficult to get going from scratch
I've been doing benchmarking of various models for finding hard security bugs, and my faith in Haiku (and Sonnet, even) has dropped precipitously in the process. Self-hosted Qwen 3.6 27B consistently outperforms both for finding security bugs, which was a shocking result. I expected Qwen to be around Haiku level, maybe a little worse, and I definitely expected it to be worse than Sonnet.
And, DeepSeek and MiMo perform much better than Haiku and Sonnet, near Opus/GPT 5.5 levels, at a fraction of the cost.
There's seemingly no reason to ever use Haiku or Sonnet, if you're not getting it for free or as part of a subscription (that you don't usually saturate).
I don't think that's what these small models are for. They are for things like text summarization and generating a title for your AI session. Maybe Haiku occupies a weird zone where it's overpowered for those tasks but underpowered for anything more sophisticated. But for example I used it on an agentic reasoning task recently (reading a chunk of information and drawing a written conclusion, not writing code) and it did just fine. More powerful model would have been a waste of money.
I don't suppose you've had a chance to benchmark MiniMax V3 yet? I've only just started testing other models after being an Anthropic fan. I haven't put MiniMax V3 to coding tasks yet, but something about my early simple tests has impressed me. The MiniMax API pricing is about 7% of Anthropic API prices (about matching Anthropic's subscription pricing).
Same opinion. Opus is best for coding, but Qwen 3.6 27b Q8 is next, before Sonnet.
Sonnet might have more knowledge and is maybe good for making excel sheets, but it does not write good code and does not follow instructions well.
But 27b Q8 needs a very beefy PC (48GB VRAM or more), so it is not an option many people can use and DS4F is so cheap right now, if you are open to externally hosted models.
Almost exactly the same story here. I've also had little to no refusals from DeepSeek, with it's Chinese values meaning substantially less friction when it comes to things like reverse engineering, finding copyrighted files, working with dubiously-sourced source code, et cetera. I don't think I'd go back to Copilot even if they dropped prices by 90%.
Yeah, seems like this is in the range of Qwen 3.6, Gemma 4, Nemotron 3 Super, and the like. There are lot of models, including much smaller cheaper ones (like Qwen 3.6 35B-A3B), that are similarly competitive with Haiku. I can run these on my laptop, I don't need to rent them from Microsoft.
I suppose if you're reeling at the new Copilot bill but want to stay in their ecosystem, this gives you something to use, but for most folks, there's a plethora of better options.
The $20/month ChatGPT plan that comes with codex is good value. Even just have premium ChatGPT is nice. I get rate limited regularly but it still lets me do most things.
The $100/month is excellent value. I don’t understand how’s that not the default option for all professional developers. Unless people don’t produce any value writing code, like playing around and experimenting with vibe coding, I understand. But if software development is your actual income, and assuming you live in a wealthy country, $100/month is nothing for a tool like Codex.
The small stuff has their place. I have this safari extension and needed a way to quickly title people's chat histories. Haiku is the fast cheap thing to come up with decent titles of blocks of text. I feel like there's a bunch of those little things lying around you need a model for. I'm even finding Apple's Foundation Model is super useful for stuff like that. Even summarizing an article. It's like equally awful at doing it, but gets enough done to still be useful as a way to be like "oh yeah, this article is actually worth reading"
Agreed.
Seems like this could have been a nice model if we would still be in the old GitHub Copilot free request/ premium multiplier mode.
It could have been a good compromise to somehow reign in the costs for Microsoft.
But with Copilot now just being paying per-token prices I don't see how this is competitive with Chinese models.
It is probably telling you can't find the costs in the announcement.
Because
Input $0.75 Cached input $0.075 Output $4.50
might be competitive with Haiku, but nobody in their right mind uses Haiku and Anthropic has abandoned it chasing the tokenmaxers who aren't thinking about budgets.
So I guess they are aiming for corporate customers that are bound to Microsoft through compliance approval that will soon start seeing their budgets explode that have to find some corporate compromise.
Won’t (presumably) all the market actors converge on similar pricing? If OpenAI stopped operating on subsidies and charge the true costs and their most token hungry customers are the ones that switch to Anthropic and others, then their pricing model switch will also be around the corner.
Unless of course we’re thinking Copilot will be more expensive than others longer term. But is that a reasonable assumption?
Anthropic & co charge API users much more, not least to demolish the middlemen low-effort plays like Cursor and Copilot. To not own the model is not viable in 2026.
Haiku does quite well if given a detailed plan. That means much more detail than you otherwise would, but you can still save over e.g. having Opus or Sonnet do everything by having them expand their initial plans into more specific levels of detail and feed it to Haiku (or similar level models).
I personally wouldn't use models that class directly, though - I'd use them in a harness as a "backend" for more capable models. And Haiku itself, as opposed to other smaller models, is still expensive.
If you use claude-code Haiku is used under the hood for certain task. I'm not sure what it is, but there's some kind of routing that goes to Haiku automatically.
Makes sense as part of a larger coding workflow, especially if it’s fast. Using a trillion parameter model to figure out how to call a targeted edit tool or generate a commit message is a waste. Also narrow tasks like “make the background darker” or “rename this function and update callers”
I've been having really good results with DeepSeek-v4-flash, qwen-3.6-moe, and the older gimini-3-flash-preview. (recent geminis suck hard)
Small models are more than enough for the majority of tasks these days. Plan and review with the bigger ones, let the little ones explore and implement.
OpenCode Go is $10/month for the open weight models with nice quotas: https://opencode.ai/go
You don’t have to limit yourself to the tiny models with the OpenCode Go plan, you can get a lot of usage from the bigger models if you keep the cache hot.
I am about 85% through my quota with 9 days left before refresh and have just used over 1B tokens, mostly DeepSeek V4 Pro, but also a little mimo 2.5 pro and kimi k2.6
> "GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs"
AI is expensive and it has been heavily subsidized. I you think $20/mo for Codex/Claude flat vs a more usage based model you're in for a shock. Especially once these companies go public and have to meet investor expectations.
> They are cute but for serious coding they tend to waste your expensive time.
90% of corporate job tasks are trivial enough that Haiku can handle them.
Just this morning I have been implementing a reprint functionality in our warehouse management system, which needed to print again carrier labels and delivery notes for a specific order.
It essentially had to do the same workflow of print, but instead of generating and uploading the pdfs, it only had to fetch and print them.
Took Opus 4.8 high 24m1 seconds and 87k tokens. Took Haiku 6m30 seconds and half the tokens.
So not really sure what do you mean by "wasting your expensive time" here. I think you really don't experiment with these tools and assume higher effort, bigger model => time saved, but that's true only when tasks are much bigger and complex enough that a smaller/less precise model would fail or land work of much lower quality.
For starters I did experiment a heck lot with models since Github Copilot gave me access to OpenAI, Gemini and Anthropic models. So I probably experimented more than the average LLMer. When GitHub Copilot had a generous quota I ran the same tasks with many models to compare them (and pursue best solution among them) quite often.
Now about my experience with Haiku, I think it was free for some time in GitHub Copilot, then it was 0.33x quota usage (when Sonnet was 1x and Opus was 3x, good times). I tried to use it for light coding for about a week.
In my tests I concluded that there was zero reason to use 0.33x priced Haiku in my coding workload because it constantly generated subpar solutions. Even when they worked, Sonnet at 1x and Opus at 3x quota usage had a lot less tech debt on average and my plan permitted continuous Sonnet/Opus usage for my workload, otherwise I would use Gemini Flash (the old one, not this 3.5 one) which was better than Haiku by a mile.
Then GPT 5.4 came at 1x quota usage and it was competitive with Opus at 3x quota usage. So I stopped using Opus in favor of GPT and by this time there was even less reason to use Haiku on my $39/mo GitHub Copilot plan.
And now we have DeepSeek v4 which is Sonnet+ levels in my tests because it has an actual 1 million token context window and their crazy alien caching tech (https://huggingface.co/blog/deepseekv4).
I urge you to throw $5 at OpenCode Go plan for 30 days and toy around with DeepSeek Flash on high setting (not max).
Or MiMo 2.5 Pro on the same OpenCode Go plan. 2 amazing models.
I really hope one day there is something like Opus 4.8 but with Cerebras' speed -- they reach over 1,000t/s on gpt-oss-120b but that model is seemingly not even properly trained for tool calling. But watching it slam out several entire screens of thinking/reasoning per second is amazing. I'd love that with Opus quality.
I like gpt oss - great model even if not too smart.. runs on my laptop at over 100ts has a certain tone that I like over all these qwens stuck up their asses.
Does anyone actually uses these smaller models for coding? If so, how? I usually Opus everything. Is the play to plan/design/architect with a heavier model than delegate structured tasks to these smaller ones? Would appreciate to hear someone's opinion on having done and tested both paths.
Yes. Divide execution of a change into separate responsibilities. Designate the main chat as the "orchestrator", Opus. You designate a goal, then tell it to grind until it gets there using the following sub-agents in sequence:
1. Step execution (Sonnet): Work for 30 minutes / 100k tokens at the direction of the Orchestrator
2. Review (Opus): Scrutinize the previous step's work for errors, fidelity to the instructions, fix those and record opportunities to improve the agent configuration + tools to reduce errors and token usage (record those to a file).
3. Self-improvement (Opus): Implement the highest impact self-improvement items that don't require user intervention.
Repeat: Until orchestrator session token budget exhausted (set it to 1M or whatever).
The underlying rationale is to keep each step manageable to maximize adherence to instructions and minimize cost (even cached tokens cost something). Prompt tokens are much cheaper than generated, so to the extent Opus mostly reviews rather than drives that saves a lot too. Self-improvement steps are very expensive but the improvements compound, if you're going to run a job for days or weeks it's way more expensive not to do them.
Edit: I do this in Claude Code with the Anthropic models as well as Qwen family models for offline use.
I use Gemini 3 Flash, I've seen the Claude Code setups, bullish on Anthropic people are driving up tokens but I am able to produce outcomes with a fraction of the money.
Do you mind sharing your workflow? What do you mean by fraction of the money, in my case personally, I'm yet to reach a session limit on the subscription plan. I'm not "tokenmaxxing" as they say, so hard to see a scenario in which the plan is expensive for the value I get.
Unless you are token rich, you'll have to find a way pretty soon.
For tasks (like kubernetes, linux, reports, database exploration and such) I use GLM5.1. Faster is actually smarter in those cases. And much cheaper too.
Opus 4.8 is for the unknown. Things I don't know how to do myself.
Because the Haiku model is quite cheap but doesn't screw up too often I used it for interactive coding for my existing projects on the older copilot plans.
For simple features I don't have a full plan worked out. I write a bit of code then tell the model in a short line prompt what it should do. Sometimes I put temporary comments in the code to give it guidance. Generally if the code change is within a file or package, Haiku is good enough follow what you ask and not mess up too much. I also have skills created over time to give it guidance. There were some months when I used GitHub copilot where I had excess credits available at the end of the month I frantically try to use up.
Even the AI code completions can be pretty good on their own. Sometimes I write some temporary comments describing what the code should do and just press Tab-Tab-Tab and the entire function is done.
I think there is a tendency for people to go for the advanced models thinking they we screw up less but if you really understand the code its easier to interactively do it with a lesser model.
Claude code itself spins a lot of its subagents with Haiku. The model has low hallucination rate, so it is great for exploration tasks. I guess this is what the best purpose of this model here will be as well. Which is a lot of tokens - many tasks spin multiple exploration agents before the planning or fixing, that is then just a few tool calls.
I was wondering the same. I guess it makes sense to use a heavy weight model to make the entire design and split the work so that smaller models (possibly local one?) would then do the coding... But how would I even do that? I'm using Claude Code. Would I need support for this within the harness ?
From my experience, smaller models like Haïku 4.5 have indeed shown very convincing results on specific, scoped tasks (themselves generated by a more capable model such as Opus 4.6). We use this kind of workflows in production to optimize speed, efficiency, and costs.
i used to use opus for everything, thats not an option once you move to a multi agent system unless you're working on like high end research. I could easily spend 3k a day if i was using opus as just a normal dev.
As we build a better and better harness and better feedback/verifiers we're switching more to 3.5 flash. I think chinese models would work too, but we cant use those atm.
Generally theres a coordinator running opus and an ever growing set of skills and subagents that take actions using weaker models and output feedback to the coordinator opus.
I'm pretty convinced at this point we're past the level of intelligence needed for most tasks most devs do and that will trend down as we better build harnesses for our own codebases.
I keep trying to, because I really want to make qwen 3.6 35b work for end implementation of a fleshed out spec (mostly for local data privacy reasons).
...but I spend so much more time correcting it, or building pipelines to try, retry, and converge, that it's rarely worthwhile for me in either time or $ spent vs Opus.
I actually find planning/design easier with a smaller model and implementation with a larger one. I'm mostly manually working with the model on planning and design and decisions are mine and smaller models are faster. And when there's a clear design/wayforward, the bigger models are usually better at understanding the overall context and applying the specific patch they were assigned to. I call it the 1-2 punch system where you do the first light punch then the harder punch when its actually important to hit properly. I know it goes against the standard of throwing the biggest model at design but I personally experience the bigger models try to do TOO MUCH and take a lot of time which is something that's not good in the design/arch/boilterplate phase.
It's so weird to me that the benchmarks remain so low, but the models are marketed as revolutionary. And if you say that low coding capabilities aren't a problem, say that to the token price hike and 'general use' model setup.
Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?
from what I understand, it's because unlike the other models, MAI models haven't yet fine-tuned against the synthetic datasets specifically designed to boost the benchmark scores.
Hard to know when they don't give the price per token. Presumably it will be comparable to a low-mid range model in terms of price. But otherwise their 'Ideal Zone' is meaningless without factoring in the price per token. I don't how much tokens are being used, that's an implementation detail to me. I care about price / performance / latency.
Yeah the future is probably a number of highly specialised small models you can run on your own hardware rather than massive frontier models in the cloud.
Please test your websites in Safari. Almost all of your iOS users use it by default, and the desktop experience is pretty close to the mobile experience, so testing is easy.
That scroll effect is jank city for me (yeah yeah works fine in Chrome/Edge).
Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.
Tomorrow NVIDIA will publish Nemotron 3 Ultra, which will be the biggest open weights LLM from a US company (550B parameters).
The early testers have confirmed that it is much better than all earlier US open weights models, but it is not as good as the best Chinese open weights models.
While Nemotron 3 Ultra is not the smartest open weights LLM, it is well optimized for fast inference, so it is much faster than the other LLMs of the same size.
In any case I believe that it is very good to have an additional option in big open weights LLMs, because until now all existing models have shown that even if some model is definitely better on average than another, the weaker model can still be better in some particular applications.
With open weights models, you can afford to try multiple LLMs for the more important tasks and then choose the best solution.
NVIDIA seem to be following a smart Intel-like strategy of selling chips and also creating software that helps create demand for those chips. With Intel it was things like MKL, IPP, OpenCV etc, and with NVIDIA it is not just CUDA and development libraries but also models like Nemotron.
The pure-AI companies like OpenAI and Anthropic are hoping to sell you API access to cloud-based AI, perhaps running on NVIDIA chips, but it seems NVIDIA's plan may be for you to run local AI, maybe from NVIDIA, running on local NVIDIA chips.
do you have any insight into the actual technical details that make this sort of things possible? I want to learn more about model architectures. Does it have to do with attention mechanisms or sparsity or something?
I was hoping Microsoft would make it open weights, as they have done for years with the Phi models.
The era of big tech releasing models into the wild might be over, which IMO is counter-productive, as we are shifting from "the model is the product" to "the harness is the product"
If only they had launched that yesterday I might have avoided Copilot auto model selection using a 9x model, quietly burning my monthly quota in a single afternoon.
To understand microsoft IA problems right now, observe that NONE of the models announced are available for use even in the microsoft foundry, which is the place were you add models to your account.
I understand github copilot rollout takes time, but why can't we consume the models via microsoft own api after launching?
Anthropic models are available at foundry the same moment they are launched, but not Microsoft's own models.
To understand microsoft IA problems right now observer the parent comment. It is literally false [1] but somehow creates a whole story of Microsoft inaptitude.
Very nice. But nowhere to be seen in my model list on github copilot enterprise ai settings? I suppose it's still rolling out. The "rolling out to github copilot" is verbatim on the blog post, not my words.
On the other hand, opus 4.8 became immediately available at copilot and foundry when launched.
Mai-voice-2 and mai-transcribe are now available for me on foundry though. Just half a day after launching.
Hear me out: i love microsoft. It's sad to see this state of AI business.
I personally do not like Microsoft, but congrats them to release this model.
While the scores are not good compare to other open weight model, the important thing to note is their training data (as they claimed) is very clean, without any synthetic datasets.
Mark Zuckerberg must be in crisis. Microsoft releasing models that compete with Claude's models. Meanwhile the only thing anyone knows about Mark's models is that they help you get hacked more easily.
i have had good results adding muse spark's contemplate mode as a roundtabler for complex questions. but you cant turn off their data ingestion for training so that is a shame.
Wait… I think he has moltbook IP as well that he can scale up.
Seriously tho, wtf is going on over at Meta? Anyone working there currently want to describe the vibe of the org when it comes to being a frontier company?
I don't understand his plan, if I were him I'd either have just gone all in on making RAM which would become very lucrative, or would have focused on building programming models. They've built some key open source technologies, but its as if Mark Zuckerberg cannot run anything that isn't a social media company / project.
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
What's with the lack of Microsoft design language on the website? It's painfully obvious they're trying to emulate Anthropic's style here and it looks tacky.
Brand guidelines and web design pretty much don't exist any more as far as I can tell. Gotta get it out yesteday, and the only way to do that is vibe coding, styling be damned.
That's neither Microsoft nor Anthropic design. It's from their acquisition of Inflection AI. Even Copilot mobile app design is basically what was Inflection's design
I've always wondered where Consumer CoPilot's design language was from.
If you watch the Build keynote with Satya, you'll notice that the design of the slides changed to Serif typography and warmer colors when Mustafa/Microsoft AI segment came on which was completely different from the rest of the keynote. Now it makes sense!
Curious how this handles token cost visibility.
One of the biggest pain points with AI coding tools right now
is having no idea what you're actually spending per project.
I'd really like to get back to an autocomplete flow, ideally with some shared and optimized context with the relationship with my larger agent models.
But it seems like, by and large, even the faster models are now aimed at longer-running agentic flows and not sub-1s autocomplete. Or am I wrong about that?
You aren't wrong, the field is moving to a world where we do less in the code editor, so autocomplete is not needed any more. I've only manually edited code a few times in the last month. Haven't used autocomplete in 6+ months since I left Copilot to build my own agent harness (I'm now mainly using OpenCode)
"Clean data" is impossible. Language models have polluted the landscape to such a degree it's impossible to filter them out now. OpenAI has no doubt discarded or muddled their dataset that was used to train the original ChatGPT, so there may be no dataset in existence now that isn't contaminated.
I went to VSC specifically to avoid the pricing I started experiencing on Cursor. After this change I have no reason to stick with GH Copilot, I'd rather keep buying OR credits.
is 51% good enough to reliably use? There's no world in which I use an AI agent where it gets even 15% of the code wrong, that's as bad a Tesla FSD where you need to pay attention to the road while engaging FSD. What's the point? My attention is what I'm trying to relieve, not mostly correct functionality. The only thing that matters is whether you can one-shot code like Claude or Codex, I'm not interested in a small but mostly-okay-but-annoyingly-buggy-every-now-and-then AI.
51% does not mean it randomly gets things wrong half the time.
These things can be useful if you can accurately predict which tasks they will reliably do, and which they will usually fail on. Then you can get much more reliable work from them.
They're comparing to Haiku, not Opus. Haiku is currently at 4.5.
Even if it were Opus, comparing to a version number makes for an interesting snapshot of time comparison: if you knew how a model performed at whatever time in was in vogue, you can say "well, it looks like Model X is about 6 months/1 year/etc. behind the frontier SOTA" - which is exactly the discussion that happens in the open-weight/local LLM space. (interesting, MAI-Code-1-Flash does not appear to be such an open-weight model, following the western trend of locking models up)
Huh, according to that model card this is a 137B total parameter model.
Performance doesn't seem that good:
- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro
- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.
Dave Citron here, from the MAI team. Thanks for the feedback, we're getting the model card updated to call out 5B active parameters (137B total).
On benchmarks: in the same VS Code harness, MAI-Code-1-Flash scored 51.2% on SWE-bench Pro vs. Haiku's 35.2% which we see as a pretty big leap. But going forward, we'll include additional models in our benchmarks, including models like Qwen 3.6 and Gemma 4.
Have you run it through DeepSWE? I understand that's probably a high ask for this class of model, but would be interesting to see regardless.
Even if it can't fully pass much, there are so many tests against most of the scenarios that you can get a fairly rich report beyond the pass@1 stat. See e.g. this DeepSWE report against the Minimax M3 model: https://entrpi.github.io/misc/deep-swe-minimax-m3/
Hey Dave, I’d love to add your new model in the harness I’m going to opensource very soonish. Going to publish benchmarks on real world tasks.
Qwen HAS to be a part of the discussion here, even though Microsoft is a US based entity. Their 30b MoE models absolutely hit way above their weight when paired with the right harness program, and can be ran on "Costco gaming computer" specs when configured correctly in llama.cpp.
Sorry Trump Administration, but while the US has been downloading more ram by throwing data centers at everything and burning up everyone's power and water, China has come out with what's effectively a prototype edge compute capable AI model - regardless of how they built it. And arguably I can tokenmaxx on it just fine at around 30-40 tokens/sec.
And also, ASICs are on the way. Imagine one of those with a heavy hitting model (MoE or otherwise, Qwen or otherwise) installed in a PCIe slot at 10k+ tokens/sec and 75 watts max (maximum wattage deliverable by the PCIe slot alone) for $300-400 USD each.
https://taalas.com/the-path-to-ubiquitous-ai/
ASIC demo here: https://chatjimmy.ai/
Sorry/not sorry to rip this whole thing to shreds. But I'm sick and tired of these inefficient LLMs being produced that seemingly can only be offered by subscription from a data center, when I'm running a full AI stack right now (model and all) on my computer at home on a 750 watt max power supply. Microsoft really needs to get with the picture here and compete more with Qwen instead of just the US/EU entities.
Sincerely, your neighbor down in Tacoma. https://www.youtube.com/watch?v=V9jlo4Ht2YA&t=229s
Qwen is definitely the model to beat as of Mid 2026. While I didn't benchmark with SWE as my use cases are OpenClaw [1]. I found both Qwen 3.6 35B A3B and more impressively Qwen 3.5 122B A10B starting to be competitive with closed flash models. The NVFP4 quant of the latter is what I'm running now on DGX.
[1] https://srinathh.medium.com/mid-size-local-models-are-now-co...
How does qwen compare to deepseek or kimi? I haven't spent much time with qwen but I find deepseek to be mostly comparable to opus for my pet projects. Kimi k2.6 did a lot of stupid stuff and talked to itself a lot "let me do X... Wait, X doesn't make sense because the user explicitly said Y"
Deepseek seems to seek first to understand before going off.
1 reply →
The take away is that this model is a smaller model that competes with Haiku, I would hope they come out with a "Sonnet" competing model, then Opus. I have been wondering why Microsoft is kind of "sleeping" on offering models they themselves have made on Copilot, maybe it was part of their deal with OpenAI? Not sure.
Yes, it's a "smaller" (137B) model that competes with Haiku, but it's basically the performance of Qwen3.6-35B-A3B which is 75% smaller and 98% smaller in terms of active parameters (since it's a mixture of experts model). Microsoft should be comparing its model to good smaller models, not Haiku 4.5.
Qwen-3.6-27b is closer to Claude Opus 4.7 than it is to Haiku 4.5 in a lot of benchmarks - and it's way smaller than Microsoft's new model.
Sure, it competes with Haiku, but it shows how far Microsoft is behind lots of other small models that are available.
5 replies →
They did release, MAI-Thinking-1 to compete with Sonnet. Totally not sure why that isn't at the top here.
3 replies →
> 137B-A5B
Yeah, not a 5B param model as the earlier title implied!
So what other models use less than half of Haiku's tokens while providing higher success rate?
Why is Haiku the benchmark though, with code generation don't we primarily care about the quality of the code - not the speed or efficiency at which it's generated?
3 replies →
While I agree directionally, I'll caveat that "cost per token" != "cost per task". In the case of Qwen3.6 it tends to think 1.6x more than Haiku, so the cost of Haiku on the same tasks tends to only be about double. More detail from comparing their Artificial Analysis metrics:
How did you get that nicely formatted graph and table in your post ?!
16 replies →
[dead]
It's a start and I welcome competition but I don't think I ever used small cloud models like Haiku 4.5. They are cute but for serious coding they tend to waste your expensive time.
And this certainly wont bring me back to GitHub Copilot which I cancelled yesterday.
GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs: https://www.reddit.com/r/GithubCopilot
I have since changed to DeekSeek Flash on high which is Sonnet+ level for almost free.
If I feel I still need smarter models I might signup for $20/mo Codex to use GPT 5.5 which, in my opinion, is the best I can access right now.
I use larger models to organize work into a topologically sorted task graph and pin smaller models to the tasks depending on the complexity with a larger model evaluating the work and patching where necessary. This uses haiku quite often for routine work. I’m able to do multi hour highly complex work with superior results and a much lower bill as a result by doing this, with a parent orchestrator able to do a massive labor within a single context window by effectively organizing work and reviewing quality and integrating where needed. I don’t use haiku directly, but it’s often 30-40% of any major efforts token use. This further improves time to completion as well as cost - but I find haiku is better at following literal instructions and plans without “second guessing,” while opus class models second guess in their thinking constantly.
As such, haiku isn’t a waste of my time, it saves enormous amounts of time for me. But I spent a large amount of time building the orchestration system up front and iterating on it to get here. Interestingly i found my experience as a director and later a distinguished engineer gave me the tools to build it and get it working well and reliably end to end - the dynamics of multi agent workflows of varying capability is not a lot different than the dynamics of a 1000 engineer organization.
Everyone does that. But I don't find Haiku useful for actual coding tasks. Good to, ehm, generate commit messages and summaries.
In my tests, openweight Qwens and GLM are way better than it.
Got anything from your orchestrator you could share that’s usable by others? Sounds like how I’d like to work but is difficult to get going from scratch
1 reply →
[flagged]
I've been doing benchmarking of various models for finding hard security bugs, and my faith in Haiku (and Sonnet, even) has dropped precipitously in the process. Self-hosted Qwen 3.6 27B consistently outperforms both for finding security bugs, which was a shocking result. I expected Qwen to be around Haiku level, maybe a little worse, and I definitely expected it to be worse than Sonnet.
And, DeepSeek and MiMo perform much better than Haiku and Sonnet, near Opus/GPT 5.5 levels, at a fraction of the cost.
There's seemingly no reason to ever use Haiku or Sonnet, if you're not getting it for free or as part of a subscription (that you don't usually saturate).
I don't think that's what these small models are for. They are for things like text summarization and generating a title for your AI session. Maybe Haiku occupies a weird zone where it's overpowered for those tasks but underpowered for anything more sophisticated. But for example I used it on an agentic reasoning task recently (reading a chunk of information and drawing a written conclusion, not writing code) and it did just fine. More powerful model would have been a waste of money.
3 replies →
I don't suppose you've had a chance to benchmark MiniMax V3 yet? I've only just started testing other models after being an Anthropic fan. I haven't put MiniMax V3 to coding tasks yet, but something about my early simple tests has impressed me. The MiniMax API pricing is about 7% of Anthropic API prices (about matching Anthropic's subscription pricing).
1 reply →
Same opinion. Opus is best for coding, but Qwen 3.6 27b Q8 is next, before Sonnet.
Sonnet might have more knowledge and is maybe good for making excel sheets, but it does not write good code and does not follow instructions well.
But 27b Q8 needs a very beefy PC (48GB VRAM or more), so it is not an option many people can use and DS4F is so cheap right now, if you are open to externally hosted models.
DeepSeek competes with Sonnet, not significantly worse or better. It tends to do weird things in codebases on the bigger side.
2 replies →
Almost exactly the same story here. I've also had little to no refusals from DeepSeek, with it's Chinese values meaning substantially less friction when it comes to things like reverse engineering, finding copyrighted files, working with dubiously-sourced source code, et cetera. I don't think I'd go back to Copilot even if they dropped prices by 90%.
Are you purchasing directly from DeepSeek? Any concerns as far as privacy or data protection?
6 replies →
Yeah, seems like this is in the range of Qwen 3.6, Gemma 4, Nemotron 3 Super, and the like. There are lot of models, including much smaller cheaper ones (like Qwen 3.6 35B-A3B), that are similarly competitive with Haiku. I can run these on my laptop, I don't need to rent them from Microsoft.
I suppose if you're reeling at the new Copilot bill but want to stay in their ecosystem, this gives you something to use, but for most folks, there's a plethora of better options.
The $20/month ChatGPT plan that comes with codex is good value. Even just have premium ChatGPT is nice. I get rate limited regularly but it still lets me do most things.
The $100/month is excellent value. I don’t understand how’s that not the default option for all professional developers. Unless people don’t produce any value writing code, like playing around and experimenting with vibe coding, I understand. But if software development is your actual income, and assuming you live in a wealthy country, $100/month is nothing for a tool like Codex.
6 replies →
The small stuff has their place. I have this safari extension and needed a way to quickly title people's chat histories. Haiku is the fast cheap thing to come up with decent titles of blocks of text. I feel like there's a bunch of those little things lying around you need a model for. I'm even finding Apple's Foundation Model is super useful for stuff like that. Even summarizing an article. It's like equally awful at doing it, but gets enough done to still be useful as a way to be like "oh yeah, this article is actually worth reading"
Small models are super useful. But I'm skeptical of their use for coding in particular, which is what this model is advertised for.
Agreed. Seems like this could have been a nice model if we would still be in the old GitHub Copilot free request/ premium multiplier mode. It could have been a good compromise to somehow reign in the costs for Microsoft.
But with Copilot now just being paying per-token prices I don't see how this is competitive with Chinese models.
It is probably telling you can't find the costs in the announcement. Because Input $0.75 Cached input $0.075 Output $4.50 might be competitive with Haiku, but nobody in their right mind uses Haiku and Anthropic has abandoned it chasing the tokenmaxers who aren't thinking about budgets.
So I guess they are aiming for corporate customers that are bound to Microsoft through compliance approval that will soon start seeing their budgets explode that have to find some corporate compromise.
Won’t (presumably) all the market actors converge on similar pricing? If OpenAI stopped operating on subsidies and charge the true costs and their most token hungry customers are the ones that switch to Anthropic and others, then their pricing model switch will also be around the corner.
Unless of course we’re thinking Copilot will be more expensive than others longer term. But is that a reasonable assumption?
Anthropic & co charge API users much more, not least to demolish the middlemen low-effort plays like Cursor and Copilot. To not own the model is not viable in 2026.
4 replies →
Haiku does quite well if given a detailed plan. That means much more detail than you otherwise would, but you can still save over e.g. having Opus or Sonnet do everything by having them expand their initial plans into more specific levels of detail and feed it to Haiku (or similar level models).
I personally wouldn't use models that class directly, though - I'd use them in a harness as a "backend" for more capable models. And Haiku itself, as opposed to other smaller models, is still expensive.
If you use claude-code Haiku is used under the hood for certain task. I'm not sure what it is, but there's some kind of routing that goes to Haiku automatically.
Makes sense as part of a larger coding workflow, especially if it’s fast. Using a trillion parameter model to figure out how to call a targeted edit tool or generate a commit message is a waste. Also narrow tasks like “make the background darker” or “rename this function and update callers”
> “rename this function and update callers”
I'm old enough to remember when IDEs could do this without needing a couple gigabytes of matrices to do it
(LLMs are great for anything even slightly more complicated ofc)
1 reply →
I've been having really good results with DeepSeek-v4-flash, qwen-3.6-moe, and the older gimini-3-flash-preview. (recent geminis suck hard)
Small models are more than enough for the majority of tasks these days. Plan and review with the bigger ones, let the little ones explore and implement.
OpenCode Go is $10/month for the open weight models with nice quotas: https://opencode.ai/go
You don’t have to limit yourself to the tiny models with the OpenCode Go plan, you can get a lot of usage from the bigger models if you keep the cache hot.
I am about 85% through my quota with 9 days left before refresh and have just used over 1B tokens, mostly DeepSeek V4 Pro, but also a little mimo 2.5 pro and kimi k2.6
1 reply →
What application/UI are you using deep seek flash high on? Still copilot or something else
> "GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs"
AI is expensive and it has been heavily subsidized. I you think $20/mo for Codex/Claude flat vs a more usage based model you're in for a shock. Especially once these companies go public and have to meet investor expectations.
> They are cute but for serious coding they tend to waste your expensive time.
90% of corporate job tasks are trivial enough that Haiku can handle them.
Just this morning I have been implementing a reprint functionality in our warehouse management system, which needed to print again carrier labels and delivery notes for a specific order.
It essentially had to do the same workflow of print, but instead of generating and uploading the pdfs, it only had to fetch and print them.
Took Opus 4.8 high 24m1 seconds and 87k tokens. Took Haiku 6m30 seconds and half the tokens.
So not really sure what do you mean by "wasting your expensive time" here. I think you really don't experiment with these tools and assume higher effort, bigger model => time saved, but that's true only when tasks are much bigger and complex enough that a smaller/less precise model would fail or land work of much lower quality.
Unfortunately there's no defending Haiku 4.5 at this point when cheaper and better options are available.
TLDR:
https://artificialanalysis.ai/models?models=gemini-3-5-flash...
and: https://i.imgur.com/nTu3VCZ.png
For starters I did experiment a heck lot with models since Github Copilot gave me access to OpenAI, Gemini and Anthropic models. So I probably experimented more than the average LLMer. When GitHub Copilot had a generous quota I ran the same tasks with many models to compare them (and pursue best solution among them) quite often.
Now about my experience with Haiku, I think it was free for some time in GitHub Copilot, then it was 0.33x quota usage (when Sonnet was 1x and Opus was 3x, good times). I tried to use it for light coding for about a week.
In my tests I concluded that there was zero reason to use 0.33x priced Haiku in my coding workload because it constantly generated subpar solutions. Even when they worked, Sonnet at 1x and Opus at 3x quota usage had a lot less tech debt on average and my plan permitted continuous Sonnet/Opus usage for my workload, otherwise I would use Gemini Flash (the old one, not this 3.5 one) which was better than Haiku by a mile.
Then GPT 5.4 came at 1x quota usage and it was competitive with Opus at 3x quota usage. So I stopped using Opus in favor of GPT and by this time there was even less reason to use Haiku on my $39/mo GitHub Copilot plan.
And now we have DeepSeek v4 which is Sonnet+ levels in my tests because it has an actual 1 million token context window and their crazy alien caching tech (https://huggingface.co/blog/deepseekv4).
I urge you to throw $5 at OpenCode Go plan for 30 days and toy around with DeepSeek Flash on high setting (not max).
Or MiMo 2.5 Pro on the same OpenCode Go plan. 2 amazing models.
2 replies →
I really hope one day there is something like Opus 4.8 but with Cerebras' speed -- they reach over 1,000t/s on gpt-oss-120b but that model is seemingly not even properly trained for tool calling. But watching it slam out several entire screens of thinking/reasoning per second is amazing. I'd love that with Opus quality.
I like gpt oss - great model even if not too smart.. runs on my laptop at over 100ts has a certain tone that I like over all these qwens stuck up their asses.
I wonder when THEY make it illegal to vote with your wallet.
Does anyone actually uses these smaller models for coding? If so, how? I usually Opus everything. Is the play to plan/design/architect with a heavier model than delegate structured tasks to these smaller ones? Would appreciate to hear someone's opinion on having done and tested both paths.
I am using Opus 4.x at work, and these "smaller" (20-80bn, 3-4bn active) models at home. Unfortunately there is no comparison, yet (IMHO anyway).
With Opus I can work, trust its designs, architecture suggestions, and code changes, even in a complex code base.
The smaller models seem to "try". They work for smaller tasks, but for more complex task it's often more work than doing it myself.
I wish it were different, and maybe in a year or two it will be.
Ya when it’s your own token budget on the line the smaller/cheaper models are more attractive.
I’ve used GPT mini quite a bit and it’s decent.
>Is the play to plan/design/architect with a heavier model than delegate structured tasks to these smaller ones?
always has been
claude code has opusplan — uses opus while in plan mode, switches to sonnet for execution.
https://code.claude.com/docs/en/model-config#opusplan-model-...
edit: you can make it work with sonnet for planning, and haiku for execution, or any other combination you fancy to work with.
https://code.claude.com/docs/en/model-config#control-the-mod...
Yes. Divide execution of a change into separate responsibilities. Designate the main chat as the "orchestrator", Opus. You designate a goal, then tell it to grind until it gets there using the following sub-agents in sequence:
1. Step execution (Sonnet): Work for 30 minutes / 100k tokens at the direction of the Orchestrator
2. Review (Opus): Scrutinize the previous step's work for errors, fidelity to the instructions, fix those and record opportunities to improve the agent configuration + tools to reduce errors and token usage (record those to a file).
3. Self-improvement (Opus): Implement the highest impact self-improvement items that don't require user intervention.
Repeat: Until orchestrator session token budget exhausted (set it to 1M or whatever).
The underlying rationale is to keep each step manageable to maximize adherence to instructions and minimize cost (even cached tokens cost something). Prompt tokens are much cheaper than generated, so to the extent Opus mostly reviews rather than drives that saves a lot too. Self-improvement steps are very expensive but the improvements compound, if you're going to run a job for days or weeks it's way more expensive not to do them.
Edit: I do this in Claude Code with the Anthropic models as well as Qwen family models for offline use.
I use Gemini 3 Flash, I've seen the Claude Code setups, bullish on Anthropic people are driving up tokens but I am able to produce outcomes with a fraction of the money.
Do you mind sharing your workflow? What do you mean by fraction of the money, in my case personally, I'm yet to reach a session limit on the subscription plan. I'm not "tokenmaxxing" as they say, so hard to see a scenario in which the plan is expensive for the value I get.
7 replies →
3 Flash is likely rather underrated here. It continues to impress me on few-shot tasks.
1 reply →
Unless you are token rich, you'll have to find a way pretty soon.
For tasks (like kubernetes, linux, reports, database exploration and such) I use GLM5.1. Faster is actually smarter in those cases. And much cheaper too.
Opus 4.8 is for the unknown. Things I don't know how to do myself.
Because the Haiku model is quite cheap but doesn't screw up too often I used it for interactive coding for my existing projects on the older copilot plans.
For simple features I don't have a full plan worked out. I write a bit of code then tell the model in a short line prompt what it should do. Sometimes I put temporary comments in the code to give it guidance. Generally if the code change is within a file or package, Haiku is good enough follow what you ask and not mess up too much. I also have skills created over time to give it guidance. There were some months when I used GitHub copilot where I had excess credits available at the end of the month I frantically try to use up.
Even the AI code completions can be pretty good on their own. Sometimes I write some temporary comments describing what the code should do and just press Tab-Tab-Tab and the entire function is done.
I think there is a tendency for people to go for the advanced models thinking they we screw up less but if you really understand the code its easier to interactively do it with a lesser model.
Claude code itself spins a lot of its subagents with Haiku. The model has low hallucination rate, so it is great for exploration tasks. I guess this is what the best purpose of this model here will be as well. Which is a lot of tokens - many tasks spin multiple exploration agents before the planning or fixing, that is then just a few tool calls.
I was wondering the same. I guess it makes sense to use a heavy weight model to make the entire design and split the work so that smaller models (possibly local one?) would then do the coding... But how would I even do that? I'm using Claude Code. Would I need support for this within the harness ?
[dead]
Not sure if considered it's considered small in any way, but DeepSeek V4 Flash is really decent.
From my experience, smaller models like Haïku 4.5 have indeed shown very convincing results on specific, scoped tasks (themselves generated by a more capable model such as Opus 4.6). We use this kind of workflows in production to optimize speed, efficiency, and costs.
i used to use opus for everything, thats not an option once you move to a multi agent system unless you're working on like high end research. I could easily spend 3k a day if i was using opus as just a normal dev.
As we build a better and better harness and better feedback/verifiers we're switching more to 3.5 flash. I think chinese models would work too, but we cant use those atm.
Generally theres a coordinator running opus and an ever growing set of skills and subagents that take actions using weaker models and output feedback to the coordinator opus.
I'm pretty convinced at this point we're past the level of intelligence needed for most tasks most devs do and that will trend down as we better build harnesses for our own codebases.
Implicitly, yes. A lot of harnesses will invoke small models to do small changes, saving time and tokens.
plan using opus execute using local
I keep trying to, because I really want to make qwen 3.6 35b work for end implementation of a fleshed out spec (mostly for local data privacy reasons).
...but I spend so much more time correcting it, or building pipelines to try, retry, and converge, that it's rarely worthwhile for me in either time or $ spent vs Opus.
I use it for smaller changes that I need to make, mainly on UI fixes or some easy logic fixes.
In DeepSWE anything from Antropic is a whole class lower than what's achievable with gpt-5.5
So by using Opus you are using "smaller" model. Well, not really smaller, just worse. The actual smaller models can at least be faster.
I actually find planning/design easier with a smaller model and implementation with a larger one. I'm mostly manually working with the model on planning and design and decisions are mine and smaller models are faster. And when there's a clear design/wayforward, the bigger models are usually better at understanding the overall context and applying the specific patch they were assigned to. I call it the 1-2 punch system where you do the first light punch then the harder punch when its actually important to hit properly. I know it goes against the standard of throwing the biggest model at design but I personally experience the bigger models try to do TOO MUCH and take a lot of time which is something that's not good in the design/arch/boilterplate phase.
[flagged]
[dead]
What is with people reimplementing window scrolling badly?
Probably vibe coded. I use StopTheMadness to prevent it.
Immediately noticed that and then closed out.
It's so weird to me that the benchmarks remain so low, but the models are marketed as revolutionary. And if you say that low coding capabilities aren't a problem, say that to the token price hike and 'general use' model setup.
Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?
from what I understand, it's because unlike the other models, MAI models haven't yet fine-tuned against the synthetic datasets specifically designed to boost the benchmark scores.
It’s about bang for buck. That high a score for 5B params is pretty good, nigh unbelievable a short while ago.
It is my belief that smaller models will get better and better, and even cloud SOTA models will shrink.
Yet another reason the current buildout will feel like the railroads.
It's 5B active params in MoE, not 5B total params (total is 137B).
> It’s about bang for buck.
Hard to know when they don't give the price per token. Presumably it will be comparable to a low-mid range model in terms of price. But otherwise their 'Ideal Zone' is meaningless without factoring in the price per token. I don't how much tokens are being used, that's an implementation detail to me. I care about price / performance / latency.
1 reply →
Yeah the future is probably a number of highly specialised small models you can run on your own hardware rather than massive frontier models in the cloud.
That's what I'm betting on anyway.
3 replies →
The SOTA models will not shrink, because the problems will get bigger, from "write me a C compiler" to "clone Stripe business and run it".
1 reply →
The introductory blog post has a lot more information
https://microsoft.ai/news/introducingmai-code-1-flash/
and the model card
https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF
The broader announcement of 7 MAI models seems to be where the 5B active in the title comes from
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
Thanks! I've changed the top link to the blog post and put the other links in the toptext.
They are comparing it to Haiku 4.5. Not Opus, not Sonnet, but Haiku, the smallest Anthropic model, 3 versions old.
4.5 is still the latest Haiku model
Any reason why they are not on openrouter yet, but the speech models are? https://openrouter.ai/models
Curious to test them and see how they perform.
Please test your websites in Safari. Almost all of your iOS users use it by default, and the desktop experience is pretty close to the mobile experience, so testing is easy.
That scroll effect is jank city for me (yeah yeah works fine in Chrome/Edge).
some kind of scroll hijack going on for sure, feels terrible on firefox+macos
I instantly close websites which use this weird scroll hijacking and slow animation nonsense.
Let me slide as fast and unrestricted as I want. I do not want to "transition" to the next paragraph.
This trend needs to stop.
So it's trained on the SWE Bench Pro evalset
That's not accurate. Take a look at the paper to see what it is trained on! And specifically decontamination is called out in A.4
https://microsoft.ai/wp-content/uploads/2026/06/main_2026060...
What is your evidence for this claim?
They say hill climbing
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.
4 replies →
not open weight or at least I did not find anything indicating open weight
Tomorrow NVIDIA will publish Nemotron 3 Ultra, which will be the biggest open weights LLM from a US company (550B parameters).
The early testers have confirmed that it is much better than all earlier US open weights models, but it is not as good as the best Chinese open weights models.
While Nemotron 3 Ultra is not the smartest open weights LLM, it is well optimized for fast inference, so it is much faster than the other LLMs of the same size.
In any case I believe that it is very good to have an additional option in big open weights LLMs, because until now all existing models have shown that even if some model is definitely better on average than another, the weaker model can still be better in some particular applications.
With open weights models, you can afford to try multiple LLMs for the more important tasks and then choose the best solution.
NVIDIA seem to be following a smart Intel-like strategy of selling chips and also creating software that helps create demand for those chips. With Intel it was things like MKL, IPP, OpenCV etc, and with NVIDIA it is not just CUDA and development libraries but also models like Nemotron.
The pure-AI companies like OpenAI and Anthropic are hoping to sell you API access to cloud-based AI, perhaps running on NVIDIA chips, but it seems NVIDIA's plan may be for you to run local AI, maybe from NVIDIA, running on local NVIDIA chips.
> it is well optimized for fast inference
do you have any insight into the actual technical details that make this sort of things possible? I want to learn more about model architectures. Does it have to do with attention mechanisms or sparsity or something?
1 reply →
:(
I was hoping Microsoft would make it open weights, as they have done for years with the Phi models.
The era of big tech releasing models into the wild might be over, which IMO is counter-productive, as we are shifting from "the model is the product" to "the harness is the product"
If only they had launched that yesterday I might have avoided Copilot auto model selection using a 9x model, quietly burning my monthly quota in a single afternoon.
Shouldn’t the next model focus not be on code but system design?
Seems like the work from a good system design to code is practically solved.
Now it’s a matter of the design of the system. Or is that represented in these evals?
Have you tried system design with LLMs? I find them pretty good at suggesting 5 architectures for a problem and then iterating on the solutions.
Even if I had no idea, going with the default suggestion would not be a terrible mistake, assuming you did describe your requirements relatively well.
Copilot brand is tarnished, so time to bung everything under MAI?
Maybe the next Windows update will change This PC back to MAI Computer. ;)
Scroll wheel hijacked on this entire domain
Fix:
The fix is to close the tab.
Yeah this website is horrendous to use. What were they thinking?
You mean "what was the LLM thinking?"
[dead]
I had to remind myself what Haiku is even for. Anthropic hasn't spent a lot of recent marketing on it.
When I need a light model, I reach for Sonnet. It is nearly free on the max plans, and quite fast. I don't see a place for Haiku in regular coding.
Haiku I guess is when you need summarization/categorization at scale.
Microsoft setting Haiku as the benchmark is a low bar.
> "It is nearly free on the max plans"
is a funny oxymoron
I don't see the point in comparing yourself to Haiku which is not only useless for coding but also old. No thanks Microsoft.
To understand microsoft IA problems right now, observe that NONE of the models announced are available for use even in the microsoft foundry, which is the place were you add models to your account.
I understand github copilot rollout takes time, but why can't we consume the models via microsoft own api after launching?
Anthropic models are available at foundry the same moment they are launched, but not Microsoft's own models.
To understand microsoft IA problems right now observer the parent comment. It is literally false [1] but somehow creates a whole story of Microsoft inaptitude.
[1] https://github.blog/changelog/2026-06-02-mai-code-1-flash-is...
Very nice. But nowhere to be seen in my model list on github copilot enterprise ai settings? I suppose it's still rolling out. The "rolling out to github copilot" is verbatim on the blog post, not my words.
On the other hand, opus 4.8 became immediately available at copilot and foundry when launched.
Mai-voice-2 and mai-transcribe are now available for me on foundry though. Just half a day after launching.
Hear me out: i love microsoft. It's sad to see this state of AI business.
So I guess the important link the marketing department forgot is this one: https://docs.github.com/en/copilot/reference/copilot-billing...
Model Input Cached input Output
MAI-Code-1-Flash $0.75 $0.075 $4.50
Comparing to
Claude Haiku 4.5 $1.00 $0.10 $5.00
looks fine.
But they also forgot to include the benchmarks comparing to
GPT-5.4 mini $0.75 $0.075 $4.50
Those would have been helpful.
And as I am on holiday today I will try to help them out:
SWE-Bench Pro 54.4 % 35.2% 51.2%
Terminal-Bench 2.0 60.0 % 41.6% 54.8%
Source: https://openai.com/index/introducing-gpt-5-4-mini-and-nano/
>Build for developers, not benchmarks
That sounds like something you say when you don't benchmark well
The technical report is very detailed and would 'reinforcement learning' of future researchers, Thanks Microsoft!
Gemma 4 26B-A4B scored exceptionally well with 20% less params, so this isn't unprecedented.
I personally do not like Microsoft, but congrats them to release this model.
While the scores are not good compare to other open weight model, the important thing to note is their training data (as they claimed) is very clean, without any synthetic datasets.
Mark Zuckerberg must be in crisis. Microsoft releasing models that compete with Claude's models. Meanwhile the only thing anyone knows about Mark's models is that they help you get hacked more easily.
Meta recently launched Muse Spark [1] and they themselves compare against Claude Opus 4.6 Max.
Here Microsoft is comparing against Claude Haiku, the smallest and least capable model from Anthropic.
[1] https://ai.meta.com/blog/introducing-muse-spark-msl/
i have had good results adding muse spark's contemplate mode as a roundtabler for complex questions. but you cant turn off their data ingestion for training so that is a shame.
Wait… I think he has moltbook IP as well that he can scale up.
Seriously tho, wtf is going on over at Meta? Anyone working there currently want to describe the vibe of the org when it comes to being a frontier company?
I don't understand his plan, if I were him I'd either have just gone all in on making RAM which would become very lucrative, or would have focused on building programming models. They've built some key open source technologies, but its as if Mark Zuckerberg cannot run anything that isn't a social media company / project.
Related ongoing thread:
MAI-Thinking-1 - https://news.ycombinator.com/item?id=48374362 - June 2026 (64 comments)
"It is built end-to-end by Microsoft using clean and appropriately licensed data."
Well still no list nor publication of the training data.
Is anyone using haiku 4.5?
Why not showcase it against something in a similar domain like qwen3.6 or gemma 4?
You lost me at forced scrolling. Ugh!
From https://news.ycombinator.com/newsguidelines.html
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
I’m not sure the message should be benchmarking.
The eye-opener is clean licensed data with filters for AI content (not sure how you do that).
If MSFT builds up using an ethical approach, there is a large anti-AI audience that might take note.
The UI has Mustafa Suleyman written all over. Seems to be as much effort in rebranding MAI as in training.
It is good to se big companies like Microsoft launching LLMs. They have large amount of compute power and good scientists to create useful models.
Microsoft has been releasing LLMs for years.
Sort of. Phi models were just trained on GPT outputs though.
2 replies →
And occasionally un-releasing them like with WizardLM.
They were mostly distilled or fine-tuned OAI models.
3 replies →
What's with the lack of Microsoft design language on the website? It's painfully obvious they're trying to emulate Anthropic's style here and it looks tacky.
Definitely vibed microslop, the giveaway is the broken header and scrolling on mobile.
The broken header is an incredible distraction. I can't believe this slipped through.
Brand guidelines and web design pretty much don't exist any more as far as I can tell. Gotta get it out yesteday, and the only way to do that is vibe coding, styling be damned.
That's neither Microsoft nor Anthropic design. It's from their acquisition of Inflection AI. Even Copilot mobile app design is basically what was Inflection's design
I've always wondered where Consumer CoPilot's design language was from.
If you watch the Build keynote with Satya, you'll notice that the design of the slides changed to Serif typography and warmer colors when Mustafa/Microsoft AI segment came on which was completely different from the rest of the keynote. Now it makes sense!
1 reply →
Thank you! This website is dreadful for accessibility and usability.
maybe it was coded by Claude
i think it is AI generated.
"It’s not just smarter; it’s leaner"
This is needlessly embarrassing, seems like a small thing, but it makes them look... desperate?
A little to minimalist - only a few hundred words on entire page!
I'd love to see a tokens per second metric. I always prioritize speed over raw intelligence for flash models.
> I always prioritize speed over raw intelligence for flash models.
This model might have a perfect speed:
Leave it long enough, and it'll print the work of Shakespear!
Maybe this will replace raptor-mini as the "free" model on copilot plans? (but I don't see it at all yet on the student plan, in vscode or the cli)
How not to flex:
"MAI-Code-1-Flash outperforms Claude Haiku 4.5"
Curious how this handles token cost visibility. One of the biggest pain points with AI coding tools right now is having no idea what you're actually spending per project.
wtf are they doing to the scroll on that page
even worse on mobile
Raw feedback to the team: 1-model looks awesome, 2-The artificially smoothed scrolling on your page feels really bad!
To be clear about the size of the model: MAI-Code-1-Flash is 137B A5B.
In a few languages MAI means no/never, so it's an apt name for a Microsoft offering.
MAI? Ma die, mai!
(gestures wildly while changing lanes in his Fiat 500)
Would be cool if this were an open model.
Claude Haiku 4.5 results with 60% fewer tokens. Sounds good, but they don't list token costs.
"Mai" means "never" in Italian. Ain't gonna happen.
“ Build for developers, not benchmarks” is the worst marketing shot I ever heard
It claims that, then promptly proceeds to showcase a bunch of benchmarks.
It works if you can prove it, but you know, Microsoft didnt
Where's the pelican when you need it the most?
Why do websites still hijack scrolling? It sucks
I'd really like to get back to an autocomplete flow, ideally with some shared and optimized context with the relationship with my larger agent models.
But it seems like, by and large, even the faster models are now aimed at longer-running agentic flows and not sub-1s autocomplete. Or am I wrong about that?
You aren't wrong, the field is moving to a world where we do less in the code editor, so autocomplete is not needed any more. I've only manually edited code a few times in the last month. Haven't used autocomplete in 6+ months since I left Copilot to build my own agent harness (I'm now mainly using OpenCode)
"Clean data" is impossible. Language models have polluted the landscape to such a degree it's impossible to filter them out now. OpenAI has no doubt discarded or muddled their dataset that was used to train the original ChatGPT, so there may be no dataset in existence now that isn't contaminated.
I mean they are comparing themselves to Haiku of all things, geez that's not a good start...
So lame … using haiku 4.5 as comparison
I went to VSC specifically to avoid the pricing I started experiencing on Cursor. After this change I have no reason to stick with GH Copilot, I'd rather keep buying OR credits.
"Build for developers, not benchmarks" Shouldn't that be.. Built?
"superintellegence team"
Why not assign them to make windows good :D
So it's not an open model while not being much better? Meh.
a lot of people got paid way too much for this garbage, enjoy your performance bonuses for taking initiative
is 51% good enough to reliably use? There's no world in which I use an AI agent where it gets even 15% of the code wrong, that's as bad a Tesla FSD where you need to pay attention to the road while engaging FSD. What's the point? My attention is what I'm trying to relieve, not mostly correct functionality. The only thing that matters is whether you can one-shot code like Claude or Codex, I'm not interested in a small but mostly-okay-but-annoyingly-buggy-every-now-and-then AI.
Claude opus 4.6 scores 51.9% on the same benchmark. Microsoft's result is quite good.
51% does not mean it randomly gets things wrong half the time.
These things can be useful if you can accurately predict which tasks they will reliably do, and which they will usually fail on. Then you can get much more reliable work from them.
Wow yes bro
how long until they rebrand this shit as copilot?
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[dead]
TLDR; this is just Claude Haiku altrenative, you can probably skip whole article.
[dead]
[flagged]
[dead]
[flagged]
Please share the script
print(ExpectedOutput)
1 reply →
Please share
Comparing against Claude 4.5? Aren't we up to 4.8? But disingenuous?
They're comparing to Haiku, not Opus. Haiku is currently at 4.5.
Even if it were Opus, comparing to a version number makes for an interesting snapshot of time comparison: if you knew how a model performed at whatever time in was in vogue, you can say "well, it looks like Model X is about 6 months/1 year/etc. behind the frontier SOTA" - which is exactly the discussion that happens in the open-weight/local LLM space. (interesting, MAI-Code-1-Flash does not appear to be such an open-weight model, following the western trend of locking models up)
Latest Haiku (smallest Anthropic Model) is version 4.5, they haven't released a new version, hence the comparison to that.