We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.
> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.
I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?
The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?
But it does have a verifiable output, no more or less than HTML+CSS. Not sure what you mean by "input" -- it's not a function that takes in parameters if that's what you're getting at, but not every app does.
Less than a year behind the SOTA, faster, and cheaper. I think Mistral is mounting a good recovery. I would not use it yet since it is not the best along any dimension that matters to me (I'm not EU-bound) but it is catching up. I think its closed source competitors are Haiku 4.5 and Gemini 3 Pro Fast (TBA) and whatever ridiculously-named light model OpenAI offers today (GPT 5.1 Codex Max Extra High Fast?)
It's open source; the price is up to the provider, and I do not see any on openrouter yet. ̶G̶i̶v̶e̶n̶ ̶t̶h̶a̶t̶ ̶d̶e̶v̶s̶t̶r̶a̶l̶ ̶i̶s̶ ̶m̶u̶c̶h̶ ̶s̶m̶a̶l̶l̶e̶r̶,̶ ̶I̶ ̶c̶a̶n̶ ̶n̶o̶t̶ ̶i̶m̶a̶g̶i̶n̶e̶ ̶i̶t̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶m̶o̶r̶e̶ ̶e̶x̶p̶e̶n̶s̶i̶v̶e̶,̶ ̶l̶e̶t̶ ̶a̶l̶o̶n̶e̶ ̶5̶x̶.̶ ̶I̶f̶ ̶a̶n̶y̶t̶h̶i̶n̶g̶ ̶D̶e̶e̶p̶S̶e̶e̶k̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶5̶x̶ ̶t̶h̶e̶ ̶c̶o̶s̶t̶.̶
edit: Mea culpa. I missed the active vs dense difference.
I gave Devstral 2 in their CLI a shot and let it run over one of my smaller private projects, about 500 KB of code. I asked it to review the codebase, understand the application's functionality, identify issues, and fix them.
It spent about half an hour, correctly identified what the program did, found two small bugs, fixed them, made some minor improvements, and added two new, small but nice features.
It introduced one new bug, but then fixed it on the first try when I pointed it out.
The changes it made to the code were minimal and localized; unlike some more "creative" models, it didn't randomly rewrite stuff it didn't have to.
It's too early to form a conclusion, but so far, it's looking quite competent.
So I tested the bigger model with my typical standard test queries which are not so tough, not so easy. They are also some that you wouldn't find extensive training data for. Finally, I already have used them to get answers from gpt-5.1, sonnet 4.5 and gemini 3 ....
Here is what I think about the bigger model: It sits between sonnet 4 and sonnet 4.5. Something like "sonnet 4.3". The response sped was pretty good.
Overall, I can see myself shifting to this for reguar day-to-day coding if they can offer this for copetitive pricing.
I'll still use sonnet 4.5 or gemini 3 for complex queries, but, for everything else code related, this seems to be pretty good.
Congrats Mistral. You most probably have caught up to the big guys. Not there yet exactly, but, not far now.
Look interesting, eager to play around with it! Devstral was a neat model when it released and one of the better ones to run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this, so gonna be interesting to see if Devstral 2 can replace it.
I'm a bit saddened by the name of the CLI tool, which to me implies the intended usage. "Vibe-coding" is a fun exercise to realize where models go wrong, but for professional work where you need tight control over the quality, you can obviously not vibe your way to excellency, hard reviews are required, so not "vibe coding" which is all about unreviewed code and just going with whatever the LLM outputs.
But regardless of that, it seems like everyone and their mother is aiming to fuel the vibe coding frenzy. But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it? All the agents seem to focus on off-handing work to vibe-coding agents, while what I want is something even tighter integrated with my tools so I can continue delivering high quality code I know and control. Where are those tools? None of the existing coding agents apparently aim for this...
Their new CLI agent tool [1] is written in Python unlike similar agents from Anthropic/Google (Typescript/Bun) and OpenAI (Rust). It also appears to have first class ACP support, where ACP is the new protocol from Zed [2].
This is exactly the CLI I'm referring to, whose name implies it's for playing around with "vibe-coding", instead of helping professional developers produce high quality code. It's the opposite of what I and many others are looking for.
A surprising amount of programming is building cardboard services or apps that only need to last six months to a year and then thrown away when temporary business needs change. Execs are constantly clamoring for semi-persistent dashboards and ETL visualized data that lasts just long enough to rein in the problem and move on to the next fire. Agentic coding is good enough for cardboard services that collapse when they get wet. I wouldn't build an industrial data lake service with it, but you can certainly build cardboard consumers of the data lake.
But there is nothing more permanent that a quickly hacked together prototype or personal productivity hack that works. There are so many Python (or Perl or Visual Basic) scripts or Excel spreadsheets - created by people who have never been "developers" - which solve in-the-trenches pain points and become indispensable in exactly the way _that_ xkcd shows.
> But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it?
Claude Code has absolutely zero features that help me review code or do anything else than vibe-coding and accept changes as they come in. We need diff-comparisons between different executions, tailored TUI for that kind of work and more. Claude Code is basically a MVP of that.
Still, I do use Claude Code and Codex daily as there is nothing better out there currently. But they still feel tailored towards vibe-coding instead of professional development.
I did, although a long time ago, so maybe I need to try it again. But it still seems to be stuck in a chat-like interface instead of something tailored to software development. Think IDE but better.
Says the person who will find themselves unable to change the software even in the slightest way without having to large refactors across everything at the same time.
High quality code matters more than ever, would be my argument. The second you let the LLM sneak in some quick hack/patch instead of correctly solving the problem, is the second you invite it to continue doing that always.
"high quality specifications" have _always_ been a thing that matters.
In my mind, it's somewhat orthogonal to code quality.
Waterfall has always been about "high quality specifications" written by people who never see any code, much less write it. Agile make specs and code quality somewhat related, but in at least some ways probably drives lower quality code in the pursuit of meeting sprint deadlines and producing testable artefacts at the expense of thoroughness/correctness/quality.
RTX Pro 6000, ends up taking ~66GB when running the MXFP4 native quant with llama-server/llama.cpp and max context, as an example. Guess you could do it with two 5090s with slightly less context, or different software aimed at memory usage efficiency.
I'm sure I'm not the only one that thinks "Vibe CLI" sounds like an unserious tool. I use Claude Code a lot and little of it is what I would consider Vibe Coding.
So people have different definitions of the word, but originally Vibe Coding meant "don't even look at the code".
If you're actually making sure it's legit, it's not vibe coding anymore. It's just... Backseat Coding? ;)
There's a level below that I call Power Coding (like power armor) where you're using a very fast model interactively to make many very small edits. So you're still doing the conceptual work of programming, but outsourcing the plumbing (LLM handles details of syntax and stdlib).
The original definition was very different. The main thing with vibe coding is that you don't care about the code. You don't even look at the code. You prompt, test that you got what you wanted, and move on. You can absolutely use cc to vibe code. But you can also use it to ... code based on prompts. Or specs. Or docs. Or whatever else. The difference is if you want / care to look at the code or not.
No, that's not the definition of "vibe coding". Vibe coding is letting the model do whatever without reviewing it and not understanding the architecture. This was the original definition and still is.
It sure doesn't feel like it given how closely I have to babysit Claude Code lest I don't recognize the code after Claude Code is done with it when left to its own devices for a minute.
Let's say you had a hardware budget of $5,000. What machine would you buy or build to run Devstral Small 2? The HuggingFace page claims it can run on a Mac with 32 GB of memory or an RTX 4090. What kind of tokens per second would you get on each? What about DGX Spark? What about RTX 5090 or Pro series? What about external GPUs on Oculink with a mini PC?
All those choices seem to have very different trade-offs? I hate $5,000 as a budget - not enough to launch you into higher-VRAM RTX Pro cards, too much (for me personally) to just spend on a "learning/experimental" system.
I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system. I mean, if I was doing some more HPC/numerical programming (say, similarity search on GPUs :-) ), I could see just taking the hit and spending $15,000 on a workstation with an RTX Pro 6000.
For grins:
Max t/s for this and smaller models? RTX 5090 system. Barely squeezing in for $5,000 today and given ram prices, maybe not actually possible tomorrow.
Max CUDA compatibility, slower t/s? DGX Spark.
Ok with slower t/s, don't care so much about CUDA, and want to run larger models? Strix Halo system with 128gb unified memory, order a framework desktop.
Prefer Macs, might run larger models? M3 Ultra with memory maxed out. Better memory bandwidth speed, mac users seem to be quite happy running locally for just messing around.
dual 3090's (24GB each) on 8x+8x pcie has been a really reliable setup for me (with nvlink bridge... even though it's relatively low bandwidth compared to tesla nvlink, it's better than going over pcie!)
48GB of vram and lots of cuda cores, hard to beat this value atm.
If you want to go even further, you can get an 8x V100 32GB server complete with 512GB ram and nvlink switching for $7000 USD from unixsurplus (ebay.com/itm/146589457908) which can run even bigger models and with healthy throughput. You would need 240V power to run that in a home lab environment though.
I'm not excited that it's done in python. I've had experience with Aider struggling to display text as fast as the llm is spitting it out, though that was probably 6 months ago now.
Something like GPT 5-mini is a lot cheaper than even Haiku but when I tried it in my experience it was so bad it was a waste of time. But it’s probably still more than 1/10 the performance of Haiku probably?
In work, where my employer pays for it, Haiku tends to be the workhorse with Sonnet or Opus when I see it flailing. On my own budget I’m a lot more cost conscious, so Haiku actually ends up being “the fancy model” and minimax m2 the “dumb model”.
Even if it is 10x cheaper and 2x worse it's going to eat up even more tokens spinning its wheels trying to implement things or squash bugs and you may end up spending more because of that. Or at least spending way more of your time.
Is it? The actual SOTA are not amazing at coding, so at least for me there is absolutely no reason to optimize on price at the moment. If I am going to use an LLM for coding it makes little sense to settle for a worse coder.
I dunno. Even pretty weak models can be decently performant, and 9/10 the performance for 1/10 the price means 10x the output, and for a lot of stuff that quality difference dosent really matter. Considering even sota models are trash, slightly worse dosent really make that much difference.
Does anyone know where their SWE-bench Verified results are from? I can't find matching results on the leaderboards for their models or the Claude models and they don't provide any links.
Ah, finally! I was checking just a few days ago if they had a Claude Code-like tool as I would much rather give money to a European effort. I'll stop my Pro subscription at Anthropic and switch over and test it out.
I was briefly excited when Mistral Vibe launched and mentions "0 MCP Servers" in its startup screen... but I can't find how to configure any MCP servers. It doesn't respond to the /mcp command, and asking Devstral 2 for help, it thinks MCP is "Model Context Preservation". I'd really like to be able to run my local MCP tools that I wrote in Golang.
I'm team Anthropic with Claude Max & Claude Code, but I'm still excited to see Mistral trying this. Mistral has occasionally saved the day for me when Claude refused an innocuous request, and it's good to have alternatives... even if Mistral / Devstral seems to be far behind the quality of Claude.
Thank you! Finally got it working, had to comment out the mcp_servers line near the top of the config.toml file in ~/.vibe/, before adding my [[mcp_servers]] sections at the end of the file.
Very nice that there's a coding cli finally. I have a Mistral Pro account. I hope that it will be included. It's the main reason to have a Pro account tbh.
> Mistral Code is available with enterprise deployments.
> Contact our team to get started.
The competition is much smoother. Where are the subscriptions which would give users the coding agent and the chat for a flat fee and working out of the box?..
Just tried it out via their free API and the Roo Code VSCode extension, and it's impressive. It walked through a data analytics and transformation problem (150.000 dataset entries) I have been debugging for the past 2 hours.
Open sourcing the TUI is pretty big news actually. Unless I missed something, I had to dig a bit to find it, but I think this is it: https://github.com/mistralai/mistral-vibe
Extremely happy with this release, the previous Devstral was great but training it for open hands crippled the usefulness. Having their own CLI dev tool will hopefully be better
The original Devstral was a collaboration between All Hands AI (OpenHands) and Mistral [1]. You can use it with other agents but had to transfer over the prompt. Even then, the agents still didn't work that well. I tried it in RooCline and it worked extremely poorly with the tool calls.
They’ll switch to military tech the second it becomes necessary, don’t kid yourself. I’m just glad we have a European alternative for the day the US decides to turn its back on us.
This tech is simply too critical to pretend the military won’t use it. That’s clearer now than ever, especially after the (so far flop-ish) launch of the U.S. military’s own genAI platform.
> I’m just glad we have a European alternative for the day the US decides to turn its back on us
Not sure you've kept up to date, US have turned their backs on most allies so far including Europe and the EU, and now welcome previous enemies with open arms.
I've not spent enough time with Mistral Vibe yet for a credible comparison, but given what I know about the underlying models (likely-1T-plus Opus 4.5 compared to the 123B Devstral 2) I'd be shocked if Vibe could out-perform Claude Code for the kinds of things I'm using it for.
offtopic but it hurts my eyes: I dislike for their font choice and their "cool looks" in their graphics.
Surprising and good is only: Everything including graphics fixed when clicking my "speedreader" button in Brave. So they are doing that "cool look" by CSS.
I gave it the job of modifying a fairly simple regex replacement and it took a while over 5 minutes, claude failed on the same prompt (which surprised me), codex did a similar job but faster. So all in all not bad!
> Devstral 2 ships under a modified MIT license, while Devstral Small 2 uses Apache 2.0. Both are open-source and permissively licensed to accelerate distributed intelligence.
Uh, the "Modified MIT license" here[0] for Devstral 2 doesn't look particularly permissively licensed (or open-source):
> 2. You are not authorized to exercise any rights under this license if the global consolidated monthly revenue of your company (or that of your employer) exceeds $20 million (or its equivalent in another currency) for the preceding month. This restriction in (b) applies to the Model and any derivatives, modifications, or combined works based on it, whether provided by Mistral AI or by a third party. You may contact Mistral AI (sales@mistral.ai) to request a commercial license, which Mistral AI may grant you at its sole discretion, or choose to use the Model on Mistral AI's hosted services available at https://mistral.ai/.
Personally I really like the normalization of these "Permissively" licensed models that only restrict companies with massive revenues from using them for free.
If you want to use something, and your company makes $240,000,000 in annual revenue, you should probably pay for it.
These are not permissively licensed though, the terms "permissive license" has connotations that pretty much everyone who is into FLOSS understands (same with "open source").
I do not mind having a license like that, my gripe is with using the terms "permissive" and "open source" like that because such use dilutes them. I cannot think of any reason to do that aside from trying to dilute the term (especially when some laws, like the EU AI Act, are less restrictive when it comes to open source AIs specifically).
That's fine, but I don't think you should call it open source or call it MIT or even 'modified MIT.' Call it Mistral license or something along those lines
Earnestly, what's the concern here? People complain about open source being mostly beneficial to megacorps, if that's the main change (idk I haven't looked too closely) then that's pretty good, no?
They are claiming something is open-source when it isn’t. Regardless of whether you think the deviation from open-source is a good thing or not, you should still be in favour of honesty.
They don't have to enforce it, evil megacorps won't risk the legal consequences of using it without talking to Mistral first. In reality they just won't use it.
Let's see which company becomes the first to sell "coding appliances": hardware with a model good enough for normal coding.
If Mistral is so permissive they could be the first ones, provided that hardware is then fast/cheap/efficient enough to create a small box that can be placed in an office.
My Macbook Pro with an M4 Pro chip can handle a number of these models (I think it has 16GB of VRAM) with reasonable performance, my bottleneck continuously is the token caps. I assume someone with a much more powerful Mac Studio could run way more than I can, considering they get access to about 96GB of VRAM out of the system RAM iirc.
...so it won't ever happen, it'll require wifi and will only be accessible via the cloud, and you'll have to pay a subscription fee to access the hardware you bought. obviously.
I am very disappointed they don't have an equivalent subscription for coding to the 200 EUR ChatGPT or Claude one, and it is only available for Enterprise deployments.
The only thing I found is a pay-as-you-go API, but I wonder if it is any good (and cost-effective) vs Claude et al.
> Devstral 2 is currently offered free via our API. After the free period, the API pricing will be $0.40/$2.00 per million tokens (input/output) for Devstral 2
With pricing so low I don't see any reason why someone would buy sub for 200 EUR. These days those subs are so much limited in Claude Code or Cursor than it used to be (or used to unlimited). Better pay-as-you-go especially when there are days when you probably use AI less or not at all (weekends/holidays etc.) as long as those credits don't expire.
Looks like another Deepseek distil like the new Ministrals. For every other use case that would be an insult, but for coding that's a great approach given how much lead in coding performance Qwen and Deepseek have on Mistral's internal datasets. The Small 24B seems to have a decent edge on 30BA3B, though it'll be comparatively extremely slow to run.
https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...
Pretty good for a 123B model!
(That said I'm not 100% certain I guessed the correct model ID, I asked Mistral here: https://x.com/simonw/status/1998435424847675429)
We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.
I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
18 replies →
It would be easy to out models that train on the bike pelican, because they would probably suck at the kayaking bumblebee.
So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.
So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?
1 reply →
I assume all of the models also have variations on, “how many ‘r’s in strawberry”.
> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.
I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?
The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?
[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...
10 replies →
If this had any substance then it could be criticized, which is what they're trying to avoid.
1 reply →
but can it recreate the spacejam 1996 website? https://www.spacejam.com/1996/jam.html
in case folks are missing the context
https://news.ycombinator.com/item?id=46183294
That is not a meaningful metric given that we don't live in 1996 and neither do our web standards.
10 replies →
I think this benchmark could be slightly misleading to assess coding model. But still very good result.
Yes, SVG is code, but not in a sense of executable with verifiable inputs and outputs.
I love that we are earnestly contemplating the merits of the pelican benchmark. What a timeline.
1 reply →
But it does have a verifiable output, no more or less than HTML+CSS. Not sure what you mean by "input" -- it's not a function that takes in parameters if that's what you're getting at, but not every app does.
Where did you get llm tool from?!
He made it: https://github.com/simonw/llm
2 replies →
Skipped the bicycle entirely and upgraded to a sweet motorcycle :)
Looks like a Cybertruck actually!
1 reply →
The Batman motorcycle!
2 replies →
Is it really an svg if it’s just embedded base64 of a jpg
You were seeing the base64 image tag output at the bottom. The SVG input is at the top.
Impressive! I'm really excited to leverage this in my gooning sessions!
Less than a year behind the SOTA, faster, and cheaper. I think Mistral is mounting a good recovery. I would not use it yet since it is not the best along any dimension that matters to me (I'm not EU-bound) but it is catching up. I think its closed source competitors are Haiku 4.5 and Gemini 3 Pro Fast (TBA) and whatever ridiculously-named light model OpenAI offers today (GPT 5.1 Codex Max Extra High Fast?)
The OpenAI thing is named Garlic.
(Surely they won't release it like that, right..?)
TIL: https://garlicmodel.com/
That looks like the next flagship rather than the fast distillation, but thanks for sharing.
6 replies →
No this is comparable to Deepseek-v3.2 even on their highlight task, with significantly worse general ability. And it's priced 5x of that.
It's open source; the price is up to the provider, and I do not see any on openrouter yet. ̶G̶i̶v̶e̶n̶ ̶t̶h̶a̶t̶ ̶d̶e̶v̶s̶t̶r̶a̶l̶ ̶i̶s̶ ̶m̶u̶c̶h̶ ̶s̶m̶a̶l̶l̶e̶r̶,̶ ̶I̶ ̶c̶a̶n̶ ̶n̶o̶t̶ ̶i̶m̶a̶g̶i̶n̶e̶ ̶i̶t̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶m̶o̶r̶e̶ ̶e̶x̶p̶e̶n̶s̶i̶v̶e̶,̶ ̶l̶e̶t̶ ̶a̶l̶o̶n̶e̶ ̶5̶x̶.̶ ̶I̶f̶ ̶a̶n̶y̶t̶h̶i̶n̶g̶ ̶D̶e̶e̶p̶S̶e̶e̶k̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶5̶x̶ ̶t̶h̶e̶ ̶c̶o̶s̶t̶.̶
edit: Mea culpa. I missed the active vs dense difference.
5 replies →
I gave Devstral 2 in their CLI a shot and let it run over one of my smaller private projects, about 500 KB of code. I asked it to review the codebase, understand the application's functionality, identify issues, and fix them.
It spent about half an hour, correctly identified what the program did, found two small bugs, fixed them, made some minor improvements, and added two new, small but nice features.
It introduced one new bug, but then fixed it on the first try when I pointed it out.
The changes it made to the code were minimal and localized; unlike some more "creative" models, it didn't randomly rewrite stuff it didn't have to.
It's too early to form a conclusion, but so far, it's looking quite competent.
On what hardware did you run it?
FWIW, it’s free through Mistral right now
1 reply →
So I tested the bigger model with my typical standard test queries which are not so tough, not so easy. They are also some that you wouldn't find extensive training data for. Finally, I already have used them to get answers from gpt-5.1, sonnet 4.5 and gemini 3 ....
Here is what I think about the bigger model: It sits between sonnet 4 and sonnet 4.5. Something like "sonnet 4.3". The response sped was pretty good.
Overall, I can see myself shifting to this for reguar day-to-day coding if they can offer this for copetitive pricing.
I'll still use sonnet 4.5 or gemini 3 for complex queries, but, for everything else code related, this seems to be pretty good.
Congrats Mistral. You most probably have caught up to the big guys. Not there yet exactly, but, not far now.
Look interesting, eager to play around with it! Devstral was a neat model when it released and one of the better ones to run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this, so gonna be interesting to see if Devstral 2 can replace it.
I'm a bit saddened by the name of the CLI tool, which to me implies the intended usage. "Vibe-coding" is a fun exercise to realize where models go wrong, but for professional work where you need tight control over the quality, you can obviously not vibe your way to excellency, hard reviews are required, so not "vibe coding" which is all about unreviewed code and just going with whatever the LLM outputs.
But regardless of that, it seems like everyone and their mother is aiming to fuel the vibe coding frenzy. But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it? All the agents seem to focus on off-handing work to vibe-coding agents, while what I want is something even tighter integrated with my tools so I can continue delivering high quality code I know and control. Where are those tools? None of the existing coding agents apparently aim for this...
Their new CLI agent tool [1] is written in Python unlike similar agents from Anthropic/Google (Typescript/Bun) and OpenAI (Rust). It also appears to have first class ACP support, where ACP is the new protocol from Zed [2].
[1] https://github.com/mistralai/mistral-vibe
[2] https://zed.dev/acp
I did not know A2A had a competitor :(
1 reply →
> Their new CLI agent tool [1] is written in
This is exactly the CLI I'm referring to, whose name implies it's for playing around with "vibe-coding", instead of helping professional developers produce high quality code. It's the opposite of what I and many others are looking for.
1 reply →
>vibe-coding
A surprising amount of programming is building cardboard services or apps that only need to last six months to a year and then thrown away when temporary business needs change. Execs are constantly clamoring for semi-persistent dashboards and ETL visualized data that lasts just long enough to rein in the problem and move on to the next fire. Agentic coding is good enough for cardboard services that collapse when they get wet. I wouldn't build an industrial data lake service with it, but you can certainly build cardboard consumers of the data lake.
You are right.
But there is nothing more permanent that a quickly hacked together prototype or personal productivity hack that works. There are so many Python (or Perl or Visual Basic) scripts or Excel spreadsheets - created by people who have never been "developers" - which solve in-the-trenches pain points and become indispensable in exactly the way _that_ xkcd shows.
> But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it?
Claude Code not good enough for ya?
Claude Code has absolutely zero features that help me review code or do anything else than vibe-coding and accept changes as they come in. We need diff-comparisons between different executions, tailored TUI for that kind of work and more. Claude Code is basically a MVP of that.
Still, I do use Claude Code and Codex daily as there is nothing better out there currently. But they still feel tailored towards vibe-coding instead of professional development.
8 replies →
> where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs?
This is what we're building at Brokk: https://brokk.ai/
Quick intro: https://blog.brokk.ai/introducing-lutz-mode/
Did you try Aider?
I did, although a long time ago, so maybe I need to try it again. But it still seems to be stuck in a chat-like interface instead of something tailored to software development. Think IDE but better.
14 replies →
I created a very unprofessional tool, which apparently does what you want!
While True:
0. Context injected automatically. (My repos are small.)
1. I describe a change.
2. LLM proposes a code edit. (Can edit multiple files simultaneously. Only one LLM call required :)
3. I accept/reject the edit.
High quality code is a thing from the past
What matters is high quality specifications including test cases
> High quality code is a thing from the past
Says the person who will find themselves unable to change the software even in the slightest way without having to large refactors across everything at the same time.
High quality code matters more than ever, would be my argument. The second you let the LLM sneak in some quick hack/patch instead of correctly solving the problem, is the second you invite it to continue doing that always.
1 reply →
"high quality specifications" have _always_ been a thing that matters.
In my mind, it's somewhat orthogonal to code quality.
Waterfall has always been about "high quality specifications" written by people who never see any code, much less write it. Agile make specs and code quality somewhat related, but in at least some ways probably drives lower quality code in the pursuit of meeting sprint deadlines and producing testable artefacts at the expense of thoroughness/correctness/quality.
> run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this
What kind of hardware do you have to be able to run a performant GPT-OSS-120b locally?
RTX Pro 6000, ends up taking ~66GB when running the MXFP4 native quant with llama-server/llama.cpp and max context, as an example. Guess you could do it with two 5090s with slightly less context, or different software aimed at memory usage efficiency.
1 reply →
The model is 64GB (int4 native), add 20GB or so for context.
There are many platforms out there that can run it decently.
AMD strix halo, Mac platforms. Two (or three without extra ram) of the new AMD AI Pro R9700 (32GB of RAM, $1200), multi consumer gpu setups, etc.
Mbp 128gb.
I'm sure I'm not the only one that thinks "Vibe CLI" sounds like an unserious tool. I use Claude Code a lot and little of it is what I would consider Vibe Coding.
They're looking for free publicity. "This French company launched a tool that lets you 'vibe' an application into being. Programmers outraged!"
Using LLM's to write code is inherently best for unserious work.
These are the cutting insights I come to HN for.
2 replies →
"Not reviewing generated code" is the problem. Not the LLM generated code.
Agree, but that's just the term for any LLM-assisted development now.
Even the Gemini 3 announcement page had some bit like "best model for vibe coding".
Maybe they are just trying to be funny.
Their chat was called "Le Chat" - it's just their style.
And while it may miss the HN crowd, one of the main selling-points of AI coding is the ease and playfulness.
If you’re letting Claude write code you’re vibe coding
So people have different definitions of the word, but originally Vibe Coding meant "don't even look at the code".
If you're actually making sure it's legit, it's not vibe coding anymore. It's just... Backseat Coding? ;)
There's a level below that I call Power Coding (like power armor) where you're using a very fast model interactively to make many very small edits. So you're still doing the conceptual work of programming, but outsourcing the plumbing (LLM handles details of syntax and stdlib).
4 replies →
The original definition was very different. The main thing with vibe coding is that you don't care about the code. You don't even look at the code. You prompt, test that you got what you wanted, and move on. You can absolutely use cc to vibe code. But you can also use it to ... code based on prompts. Or specs. Or docs. Or whatever else. The difference is if you want / care to look at the code or not.
No, that's not the definition of "vibe coding". Vibe coding is letting the model do whatever without reviewing it and not understanding the architecture. This was the original definition and still is.
It sure doesn't feel like it given how closely I have to babysit Claude Code lest I don't recognize the code after Claude Code is done with it when left to its own devices for a minute.
1 reply →
Let's say you had a hardware budget of $5,000. What machine would you buy or build to run Devstral Small 2? The HuggingFace page claims it can run on a Mac with 32 GB of memory or an RTX 4090. What kind of tokens per second would you get on each? What about DGX Spark? What about RTX 5090 or Pro series? What about external GPUs on Oculink with a mini PC?
All those choices seem to have very different trade-offs? I hate $5,000 as a budget - not enough to launch you into higher-VRAM RTX Pro cards, too much (for me personally) to just spend on a "learning/experimental" system.
I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system. I mean, if I was doing some more HPC/numerical programming (say, similarity search on GPUs :-) ), I could see just taking the hit and spending $15,000 on a workstation with an RTX Pro 6000.
For grins:
Max t/s for this and smaller models? RTX 5090 system. Barely squeezing in for $5,000 today and given ram prices, maybe not actually possible tomorrow.
Max CUDA compatibility, slower t/s? DGX Spark.
Ok with slower t/s, don't care so much about CUDA, and want to run larger models? Strix Halo system with 128gb unified memory, order a framework desktop.
Prefer Macs, might run larger models? M3 Ultra with memory maxed out. Better memory bandwidth speed, mac users seem to be quite happy running locally for just messing around.
You'll probably find better answers heading off to https://www.reddit.com/r/LocalLLaMA/ for actual benchmarks.
> I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system.
That's a good idea!
Curious about this, if you don't mind sharing:
- what's the stack ? (Do you run like llama.cpp on that rented machine?)
- what model(s) do you run there?
- what's your rough monthly cost? (Does it come up much cheaper than if you called the equivalent paid APIs)
5 replies →
dual 3090's (24GB each) on 8x+8x pcie has been a really reliable setup for me (with nvlink bridge... even though it's relatively low bandwidth compared to tesla nvlink, it's better than going over pcie!)
48GB of vram and lots of cuda cores, hard to beat this value atm.
If you want to go even further, you can get an 8x V100 32GB server complete with 512GB ram and nvlink switching for $7000 USD from unixsurplus (ebay.com/itm/146589457908) which can run even bigger models and with healthy throughput. You would need 240V power to run that in a home lab environment though.
V100 is outdated (no bf16, dropped in CUDA 13) and power hungry (8 cards 3 years continuous use are about $12k of electricity).
I'd throw a 7900xtx in an AM4 rig with 128gb of ddr4 (which is what I've been using for the past two years)
Fuck nvidia
You know, I haven't even been thinking about those AMD gpus for local llms and it is clearly a blind spot for me.
How is it? I'd guess a bunch of the MoE models actually run well?
1 reply →
Get a Radeon AI Pro r9700! 32GB of RAM
I'm glad it's not another LLM CLI that uses React. Vibe-cli seems to be built with https://github.com/textualize/textual/
I'm not excited that it's done in python. I've had experience with Aider struggling to display text as fast as the llm is spitting it out, though that was probably 6 months ago now.
thats an issue with aider. using a proper framework in the alternate terminal buffer would have greatly benefitted them
Python is more than capable of doing that. It’s not an issue of raw execution speed.
https://willmcgugan.github.io/streaming-markdown/
Just added it to our inventory. For those of you using Nix:
The repo is updated daily.
This is such a cool project. Thanks for sharing.
10x cheaper price per token than Claude, am I reading it right?
As long as it doesn't mean 10x worse performance, that's a good selling point.
Something like GPT 5-mini is a lot cheaper than even Haiku but when I tried it in my experience it was so bad it was a waste of time. But it’s probably still more than 1/10 the performance of Haiku probably?
In work, where my employer pays for it, Haiku tends to be the workhorse with Sonnet or Opus when I see it flailing. On my own budget I’m a lot more cost conscious, so Haiku actually ends up being “the fancy model” and minimax m2 the “dumb model”.
Even if it is 10x cheaper and 2x worse it's going to eat up even more tokens spinning its wheels trying to implement things or squash bugs and you may end up spending more because of that. Or at least spending way more of your time.
The benchmark of swe places it in a comparable score with respect to open models and just a few points below the top notch models though
Is it? The actual SOTA are not amazing at coding, so at least for me there is absolutely no reason to optimize on price at the moment. If I am going to use an LLM for coding it makes little sense to settle for a worse coder.
I dunno. Even pretty weak models can be decently performant, and 9/10 the performance for 1/10 the price means 10x the output, and for a lot of stuff that quality difference dosent really matter. Considering even sota models are trash, slightly worse dosent really make that much difference.
2 replies →
This is great! I just made an AUR package for it: https://aur.archlinux.org/packages/mistral-vibe
Does anyone know where their SWE-bench Verified results are from? I can't find matching results on the leaderboards for their models or the Claude models and they don't provide any links.
Ah, finally! I was checking just a few days ago if they had a Claude Code-like tool as I would much rather give money to a European effort. I'll stop my Pro subscription at Anthropic and switch over and test it out.
I was briefly excited when Mistral Vibe launched and mentions "0 MCP Servers" in its startup screen... but I can't find how to configure any MCP servers. It doesn't respond to the /mcp command, and asking Devstral 2 for help, it thinks MCP is "Model Context Preservation". I'd really like to be able to run my local MCP tools that I wrote in Golang.
I'm team Anthropic with Claude Max & Claude Code, but I'm still excited to see Mistral trying this. Mistral has occasionally saved the day for me when Claude refused an innocuous request, and it's good to have alternatives... even if Mistral / Devstral seems to be far behind the quality of Claude.
Check this out: https://github.com/mistralai/mistral-vibe?tab=readme-ov-file...
Thank you! Finally got it working, had to comment out the mcp_servers line near the top of the config.toml file in ~/.vibe/, before adding my [[mcp_servers]] sections at the end of the file.
That was very helpful, thanks!
Very nice that there's a coding cli finally. I have a Mistral Pro account. I hope that it will be included. It's the main reason to have a Pro account tbh.
> Mistral Code is available with enterprise deployments. > Contact our team to get started.
The competition is much smoother. Where are the subscriptions which would give users the coding agent and the chat for a flat fee and working out of the box?..
Just tried it out via their free API and the Roo Code VSCode extension, and it's impressive. It walked through a data analytics and transformation problem (150.000 dataset entries) I have been debugging for the past 2 hours.
Open sourcing the TUI is pretty big news actually. Unless I missed something, I had to dig a bit to find it, but I think this is it: https://github.com/mistralai/mistral-vibe
Going to start hacking on this ASAP
Extremely happy with this release, the previous Devstral was great but training it for open hands crippled the usefulness. Having their own CLI dev tool will hopefully be better
Can you explain "training it for open hands"? I can't parse the meaning.
The original Devstral was a collaboration between All Hands AI (OpenHands) and Mistral [1]. You can use it with other agents but had to transfer over the prompt. Even then, the agents still didn't work that well. I tried it in RooCline and it worked extremely poorly with the tool calls.
[1] https://openhands.dev/blog/devstral-a-new-state-of-the-art-o...
I'm so glad Mistral never sold out. We're really lucky to have them in the EU at the time when we're so focused on mil-tech etc.
I don’t think it was ever an option since it had ties with the french government early on (Cédric O) and Macron’s party is quite pro EU
They let so many important French companies down. So, yes, it could happen despite this beginning.
They’ll switch to military tech the second it becomes necessary, don’t kid yourself. I’m just glad we have a European alternative for the day the US decides to turn its back on us.
This tech is simply too critical to pretend the military won’t use it. That’s clearer now than ever, especially after the (so far flop-ish) launch of the U.S. military’s own genAI platform.
They have already:
- https://helsing.ai/newsroom/helsing-and-mistral-announce-str... - https://sifted.eu/articles/mistral-helsing-defence-ai-action... - Luxembourg army chose Mistral: https://www.forcesoperations.com/la-pepite-francaise-mistral... - French army: https://www.defense.gouv.fr/actualites/ia-defense-sebastien-...
> I’m just glad we have a European alternative for the day the US decides to turn its back on us
Not sure you've kept up to date, US have turned their backs on most allies so far including Europe and the EU, and now welcome previous enemies with open arms.
1 reply →
It's not like there aren't already military AI startups in the EU. e.g. Helsing.
> I’m just glad we have a European alternative for the day the US decides to turn its back on us.
They did.
The system prompt and tool prompts for their open source (Apache 2 licensed) Python+Textual+Pydantic CLI tool are fun to read:
core/prompts/cli.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
core/prompts/compact.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/bash.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/grep.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/read_file.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/write_file.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/search_replace.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/todo.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
Based on your experience with Claude Code, how does Mistral Vibe compare?
I've not spent enough time with Mistral Vibe yet for a credible comparison, but given what I know about the underlying models (likely-1T-plus Opus 4.5 compared to the 123B Devstral 2) I'd be shocked if Vibe could out-perform Claude Code for the kinds of things I'm using it for.
Here's n example of the kinds of things I do with Claude Code now: https://gistpreview.github.io/?b64d5ee40439877eee7c224539452... - that one involved several from-scratch rewrites of the history of an entire Git repo just because I felt like it.
offtopic but it hurts my eyes: I dislike for their font choice and their "cool looks" in their graphics.
Surprising and good is only: Everything including graphics fixed when clicking my "speedreader" button in Brave. So they are doing that "cool look" by CSS.
Yeah, it's a bit gimicky. You can hit `esc` and it will revert to the normal page design.
There's a scan lines affect they apply to everything that's "cool", but gets old after a minute.
I gave it the job of modifying a fairly simple regex replacement and it took a while over 5 minutes, claude failed on the same prompt (which surprised me), codex did a similar job but faster. So all in all not bad!
Finally, we can use a european model to replace claude code.
> Devstral 2 ships under a modified MIT license, while Devstral Small 2 uses Apache 2.0. Both are open-source and permissively licensed to accelerate distributed intelligence.
Uh, the "Modified MIT license" here[0] for Devstral 2 doesn't look particularly permissively licensed (or open-source):
> 2. You are not authorized to exercise any rights under this license if the global consolidated monthly revenue of your company (or that of your employer) exceeds $20 million (or its equivalent in another currency) for the preceding month. This restriction in (b) applies to the Model and any derivatives, modifications, or combined works based on it, whether provided by Mistral AI or by a third party. You may contact Mistral AI (sales@mistral.ai) to request a commercial license, which Mistral AI may grant you at its sole discretion, or choose to use the Model on Mistral AI's hosted services available at https://mistral.ai/.
[0] https://huggingface.co/mistralai/Devstral-2-123B-Instruct-25...
Personally I really like the normalization of these "Permissively" licensed models that only restrict companies with massive revenues from using them for free.
If you want to use something, and your company makes $240,000,000 in annual revenue, you should probably pay for it.
These are not permissively licensed though, the terms "permissive license" has connotations that pretty much everyone who is into FLOSS understands (same with "open source").
I do not mind having a license like that, my gripe is with using the terms "permissive" and "open source" like that because such use dilutes them. I cannot think of any reason to do that aside from trying to dilute the term (especially when some laws, like the EU AI Act, are less restrictive when it comes to open source AIs specifically).
4 replies →
That's fine, but I don't think you should call it open source or call it MIT or even 'modified MIT.' Call it Mistral license or something along those lines
19 replies →
Earnestly, what's the concern here? People complain about open source being mostly beneficial to megacorps, if that's the main change (idk I haven't looked too closely) then that's pretty good, no?
They are claiming something is open-source when it isn’t. Regardless of whether you think the deviation from open-source is a good thing or not, you should still be in favour of honesty.
13 replies →
Mainly about the dilution of the term. Though TBH i do not think that open source is beneficial mostly to megacorps either.
Mistral have used janky licenses in that a few times in the past. I was hoping the competition from China might have snapped them out of it.
All "Open Source" licenses are to an extent, janky. Obligatory "Stallman was right;" -- If it's not GPL/Free Software, YMMV.
Is such a term even enforceable? How would it be? How could Mistral know how much a company makes if that information isn't public?
They don't have to enforce it, evil megacorps won't risk the legal consequences of using it without talking to Mistral first. In reality they just won't use it.
Somehow it writes bad React code and misses to check linting prompts half the time. But surprisingly, the Python coding was great!
> Model Size (B tokens)
How is that a measure of model size? It should either be parameter size, activated parameters, or cost per output token.
Looks like a typo because the models line up with reported param sizes.
will definetey try mistral vibe with gpt-oss-20b
Let's see which company becomes the first to sell "coding appliances": hardware with a model good enough for normal coding.
If Mistral is so permissive they could be the first ones, provided that hardware is then fast/cheap/efficient enough to create a small box that can be placed in an office.
Maybe in 5 years.
My Macbook Pro with an M4 Pro chip can handle a number of these models (I think it has 16GB of VRAM) with reasonable performance, my bottleneck continuously is the token caps. I assume someone with a much more powerful Mac Studio could run way more than I can, considering they get access to about 96GB of VRAM out of the system RAM iirc.
I bought a framework desktop hoping to do this.
And it can do it, right? I think AMD AI Max line the first realistic offering for this type of thing.
The Apple offerings are interesting but the lack of x86, Linux, and general compatibility make it hard sell imo.
my bet is a deepseek box
llm in a box connected via usb is the dream.
...so it won't ever happen, it'll require wifi and will only be accessible via the cloud, and you'll have to pay a subscription fee to access the hardware you bought. obviously.
PSA: 10X savings when you have to prompt it 10 times to get the correct solution is not actually faster.
I am very disappointed they don't have an equivalent subscription for coding to the 200 EUR ChatGPT or Claude one, and it is only available for Enterprise deployments.
The only thing I found is a pay-as-you-go API, but I wonder if it is any good (and cost-effective) vs Claude et al.
> Devstral 2 is currently offered free via our API. After the free period, the API pricing will be $0.40/$2.00 per million tokens (input/output) for Devstral 2
With pricing so low I don't see any reason why someone would buy sub for 200 EUR. These days those subs are so much limited in Claude Code or Cursor than it used to be (or used to unlimited). Better pay-as-you-go especially when there are days when you probably use AI less or not at all (weekends/holidays etc.) as long as those credits don't expire.
True, I just wish I could pay once for code AND the chat, but the chat subscription does not include Code sadly.
At these rates you can afford to pay by the token.
In a figure: Model size (B tokens)?
did anyone test how up to date is knowledge?
After querying the model about .NET, it seems that its knowledge comes from around June 2024.
I confirm that. It had no idea how to use Deno v2+.
Looks like another Deepseek distil like the new Ministrals. For every other use case that would be an insult, but for coding that's a great approach given how much lead in coding performance Qwen and Deepseek have on Mistral's internal datasets. The Small 24B seems to have a decent edge on 30BA3B, though it'll be comparatively extremely slow to run.
Can Vibe CLI help me vibe code PRs for when I vibe on the https://github.com/buttplugio/buttplug repo?
You can do anything if you believe.
Yet another CLI.
Why does every AI provider need to have its own tool, instead of contributing to existing tools like Roo Code or Opencode?
My 2ct: Because providers want to make their model run optimally and maybe some of them try to build a moat.
> providers want to make their model run optimally
Because they couldn't do it by contributing to existing opensource tools?
Modified MIT?????
Just call it Mistral License & flush it down