Mistral releases Devstral2 and Mistral Vibe CLI

2 months ago (mistral.ai)

362 comments

pember

  llm install llm-mistral
  llm mistral refresh
  llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle"

https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...

Pretty good for a 123B model!

(That said I'm not 100% certain I guessed the correct model ID, I asked Mistral here: https://x.com/simonw/status/1998435424847675429)

Jimmc414 2 months ago
We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.
- simonw 2 months ago
  
  I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
  
  37 replies →
- thatwasunusual 2 months ago
  
  > We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.
  I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?
  The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?
  [0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...
  
  14 replies →
- Workaccount2 2 months ago
  
  It would be easy to out models that train on the bike pelican, because they would probably suck at the kayaking bumblebee.
  So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.
  So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?
  
  1 reply →
- th0ma5 2 months ago
  
  If this had any substance then it could be criticized, which is what they're trying to avoid.
  
  1 reply →
- 0cf8612b2e1e 2 months ago
  
  I assume all of the models also have variations on, “how many ‘r’s in strawberry”.
  
  3 replies →
baq 2 months ago
but can it recreate the spacejam 1996 website? https://www.spacejam.com/1996/jam.html
- aschobel 2 months ago
  
  in case folks are missing the context
  https://news.ycombinator.com/item?id=46183294
- lagniappe 2 months ago
  
  That is not a meaningful metric given that we don't live in 1996 and neither do our web standards.
  
  11 replies →
cpursley 2 months ago
Skipped the bicycle entirely and upgraded to a sweet motorcycle :)
- aorth 2 months ago
  
  Looks like a Cybertruck actually!
  
  1 reply →
- lubujackson 2 months ago
  
  The Batman motorcycle!
  
  2 replies →
willahmad 2 months ago
I think this benchmark could be slightly misleading to assess coding model. But still very good result.
Yes, SVG is code, but not in a sense of executable with verifiable inputs and outputs.
- jstummbillig 2 months ago
  
  I love that we are earnestly contemplating the merits of the pelican benchmark. What a timeline.
  
  1 reply →
- hdjrudni 2 months ago
  
  But it does have a verifiable output, no more or less than HTML+CSS. Not sure what you mean by "input" -- it's not a function that takes in parameters if that's what you're getting at, but not every app does.
iberator 2 months ago
Where did you get llm tool from?!
- fauigerzigerk 2 months ago
  
  He made it: https://github.com/simonw/llm
  
  2 replies →
lacoolj 2 months ago
How did you run a 123B model locally? Or did you do this on a GPU host somewhere? If so, what spec was it?
- simonw 2 months ago
  
  I haven't run the 123B one locally yet. I used Mistral's own API models for this.
felixg3 2 months ago
Is it really an svg if it’s just embedded base64 of a jpg
- joombaga 2 months ago
  
  You were seeing the base64 image tag output at the bottom. The SVG input is at the top.
samgutentag 2 months ago

"Generate an SVG of a pelican riding a bicycle" is the new "but can it run Crysis"
breedmesmn 2 months ago
[flagged]

esafak 2 months ago

Less than a year behind the SOTA, faster, and cheaper. I think Mistral is mounting a good recovery. I would not use it yet since it is not the best along any dimension that matters to me (I'm not EU-bound) but it is catching up. I think its closed source competitors are Haiku 4.5 and Gemini 3 Pro Fast (TBA) and whatever ridiculously-named light model OpenAI offers today (GPT 5.1 Codex Max Extra High Fast?)

kevin061 2 months ago
The OpenAI thing is named Garlic.
(Surely they won't release it like that, right..?)
- esafak 2 months ago
  
  TIL: https://garlicmodel.com/
  That looks like the next flagship rather than the fast distillation, but thanks for sharing.
  
  6 replies →
YetAnotherNick 2 months ago
No this is comparable to Deepseek-v3.2 even on their highlight task, with significantly worse general ability. And it's priced 5x of that.
- esafak 2 months ago
  
  It's open source; the price is up to the provider, and I do not see any on openrouter yet. ̶G̶i̶v̶e̶n̶ ̶t̶h̶a̶t̶ ̶d̶e̶v̶s̶t̶r̶a̶l̶ ̶i̶s̶ ̶m̶u̶c̶h̶ ̶s̶m̶a̶l̶l̶e̶r̶,̶ ̶I̶ ̶c̶a̶n̶ ̶n̶o̶t̶ ̶i̶m̶a̶g̶i̶n̶e̶ ̶i̶t̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶m̶o̶r̶e̶ ̶e̶x̶p̶e̶n̶s̶i̶v̶e̶,̶ ̶l̶e̶t̶ ̶a̶l̶o̶n̶e̶ ̶5̶x̶.̶ ̶I̶f̶ ̶a̶n̶y̶t̶h̶i̶n̶g̶ ̶D̶e̶e̶p̶S̶e̶e̶k̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶5̶x̶ ̶t̶h̶e̶ ̶c̶o̶s̶t̶.̶
  edit: Mea culpa. I missed the active vs dense difference.
  
  5 replies →

InsideOutSanta 2 months ago

I gave Devstral 2 in their CLI a shot and let it run over one of my smaller private projects, about 500 KB of code. I asked it to review the codebase, understand the application's functionality, identify issues, and fix them.

It spent about half an hour, correctly identified what the program did, found two small bugs, fixed them, made some minor improvements, and added two new, small but nice features.

It introduced one new bug, but then fixed it on the first try when I pointed it out.

The changes it made to the code were minimal and localized; unlike some more "creative" models, it didn't randomly rewrite stuff it didn't have to.

It's too early to form a conclusion, but so far, it's looking quite competent.

Staross 2 months ago
Also tried it on a small project, it did ok finding issues but completely failed doing rather basic edits, like it lost closing brackets or used wrong syntax and couldn't recover. The CLI was easy to setup and use though.
- embedding-shape 2 months ago
  
  Did you try it via OpenRouter? If so, what provider? I've noticed some providers seems to not exactly be upfront about what quantization they're using, you can see that the responses from some providers who supposedly run the exact same model and weights give vastly different responses.
  Back when Devstral 1 released, this was made very noticeable to me because the ones who used the smaller quantizations were unable to actually properly format the code, just as you noticed, that's why this sounded so similar to what I've seen before.
- InsideOutSanta 2 months ago
  
  In my experience, the messed up closing brackets are a surprisingly common issue for LLMs. Both Sonnet 4.5 and Gemini 3 also do this regularly. Seems like something that should be relatively easy to fix, though.
  
  1 reply →
MLgulabio 2 months ago
On what hardware did you run it?
- syntaxing 2 months ago
  
  FWIW, it’s free through Mistral right now
  
  2 replies →

freakynit 2 months ago

So I tested the bigger model with my typical standard test queries which are not so tough, not so easy. They are also some that you wouldn't find extensive training data for. Finally, I already have used them to get answers from gpt-5.1, sonnet 4.5 and gemini 3 ....

Here is what I think about the bigger model: It sits between sonnet 4 and sonnet 4.5. Something like "sonnet 4.3". The response sped was pretty good.

Overall, I can see myself shifting to this for reguar day-to-day coding if they can offer this for copetitive pricing.

I'll still use sonnet 4.5 or gemini 3 for complex queries, but, for everything else code related, this seems to be pretty good.

Congrats Mistral. You most probably have caught up to the big guys. Not there yet exactly, but, not far now.

embedding-shape 2 months ago

Look interesting, eager to play around with it! Devstral was a neat model when it released and one of the better ones to run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this, so gonna be interesting to see if Devstral 2 can replace it.

I'm a bit saddened by the name of the CLI tool, which to me implies the intended usage. "Vibe-coding" is a fun exercise to realize where models go wrong, but for professional work where you need tight control over the quality, you can obviously not vibe your way to excellency, hard reviews are required, so not "vibe coding" which is all about unreviewed code and just going with whatever the LLM outputs.

But regardless of that, it seems like everyone and their mother is aiming to fuel the vibe coding frenzy. But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it? All the agents seem to focus on off-handing work to vibe-coding agents, while what I want is something even tighter integrated with my tools so I can continue delivering high quality code I know and control. Where are those tools? None of the existing coding agents apparently aim for this...

williamstein 2 months ago
Their new CLI agent tool [1] is written in Python unlike similar agents from Anthropic/Google (Typescript/Bun) and OpenAI (Rust). It also appears to have first class ACP support, where ACP is the new protocol from Zed [2].
[1] https://github.com/mistralai/mistral-vibe
[2] https://zed.dev/acp
- esafak 2 months ago
  
  I did not know A2A had a competitor :(
  
  1 reply →
- embedding-shape 2 months ago
  
  > Their new CLI agent tool [1] is written in
  This is exactly the CLI I'm referring to, whose name implies it's for playing around with "vibe-coding", instead of helping professional developers produce high quality code. It's the opposite of what I and many others are looking for.
  
  1 reply →
hadlock 2 months ago
>vibe-coding
A surprising amount of programming is building cardboard services or apps that only need to last six months to a year and then thrown away when temporary business needs change. Execs are constantly clamoring for semi-persistent dashboards and ETL visualized data that lasts just long enough to rein in the problem and move on to the next fire. Agentic coding is good enough for cardboard services that collapse when they get wet. I wouldn't build an industrial data lake service with it, but you can certainly build cardboard consumers of the data lake.
- bigiain 2 months ago
  
  You are right.
  But there is nothing more permanent that a quickly hacked together prototype or personal productivity hack that works. There are so many Python (or Perl or Visual Basic) scripts or Excel spreadsheets - created by people who have never been "developers" - which solve in-the-trenches pain points and become indispensable in exactly the way _that_ xkcd shows.
- 3vidence 2 months ago
  
  There is a phrase I've heard a number of times in my career that I find relevant here.
  "There is nothing more permanent than a temporary demo"
pdntspa 2 months ago
> But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it?
Claude Code not good enough for ya?
- embedding-shape 2 months ago
  
  Claude Code has absolutely zero features that help me review code or do anything else than vibe-coding and accept changes as they come in. We need diff-comparisons between different executions, tailored TUI for that kind of work and more. Claude Code is basically a MVP of that.
  Still, I do use Claude Code and Codex daily as there is nothing better out there currently. But they still feel tailored towards vibe-coding instead of professional development.
  
  17 replies →
jbellis 2 months ago

> where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs?
This is what we're building at Brokk: https://brokk.ai/
Quick intro: https://blog.brokk.ai/introducing-lutz-mode/
johanvts 2 months ago
Did you try Aider?
- embedding-shape 2 months ago
  
  I did, although a long time ago, so maybe I need to try it again. But it still seems to be stuck in a chat-like interface instead of something tailored to software development. Think IDE but better.
  
  15 replies →
andai 2 months ago

I created a very unprofessional tool, which apparently does what you want!
While True:
0. Context injected automatically. (My repos are small.)
1. I describe a change.
2. LLM proposes a code edit. (Can edit multiple files simultaneously. Only one LLM call required :)
3. I accept/reject the edit.
chrsw 2 months ago
> run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this
What kind of hardware do you have to be able to run a performant GPT-OSS-120b locally?
- embedding-shape 2 months ago
  
  RTX Pro 6000, ends up taking ~66GB when running the MXFP4 native quant with llama-server/llama.cpp and max context, as an example. Guess you could do it with two 5090s with slightly less context, or different software aimed at memory usage efficiency.
  
  1 reply →
- fgonzag 2 months ago
  
  The model is 64GB (int4 native), add 20GB or so for context.
  There are many platforms out there that can run it decently.
  AMD strix halo, Mac platforms. Two (or three without extra ram) of the new AMD AI Pro R9700 (32GB of RAM, $1200), multi consumer gpu setups, etc.
- FuckButtons 2 months ago
  
  Mbp 128gb.
true2octave 2 months ago
High quality code is a thing from the past
What matters is high quality specifications including test cases
- embedding-shape 2 months ago
  
  > High quality code is a thing from the past
  Says the person who will find themselves unable to change the software even in the slightest way without having to large refactors across everything at the same time.
  High quality code matters more than ever, would be my argument. The second you let the LLM sneak in some quick hack/patch instead of correctly solving the problem, is the second you invite it to continue doing that always.
  
  1 reply →
- bigiain 2 months ago
  
  "high quality specifications" have _always_ been a thing that matters.
  In my mind, it's somewhat orthogonal to code quality.
  Waterfall has always been about "high quality specifications" written by people who never see any code, much less write it. Agile make specs and code quality somewhat related, but in at least some ways probably drives lower quality code in the pursuit of meeting sprint deadlines and producing testable artefacts at the expense of thoroughness/correctness/quality.
htrp 2 months ago

what's wrong with the current ide tools?

pluralmonad 2 months ago

I'm sure I'm not the only one that thinks "Vibe CLI" sounds like an unserious tool. I use Claude Code a lot and little of it is what I would consider Vibe Coding.

tormeh 2 months ago

They're looking for free publicity. "This French company launched a tool that lets you 'vibe' an application into being. Programmers outraged!"
klysm 2 months ago
Using LLM's to write code is inherently best for unserious work.
- dwaltrip 2 months ago
  
  These are the cutting insights I come to HN for.
  
  2 replies →
- freakynit 2 months ago
  
  "Not reviewing generated code" is the problem. Not the LLM generated code.
jimmydoe 2 months ago
Maybe they are just trying to be funny.
- Eupolemos 2 months ago
  
  Their chat was called "Le Chat" - it's just their style.
  And while it may miss the HN crowd, one of the main selling-points of AI coding is the ease and playfulness.
  
  2 replies →
kilpikaarna 2 months ago

Agree, but that's just the term for any LLM-assisted development now.
Even the Gemini 3 announcement page had some bit like "best model for vibe coding".
isodev 2 months ago
If you’re letting Claude write code you’re vibe coding
- andai 2 months ago
  
  So people have different definitions of the word, but originally Vibe Coding meant "don't even look at the code".
  If you're actually making sure it's legit, it's not vibe coding anymore. It's just... Backseat Coding? ;)
  There's a level below that I call Power Coding (like power armor) where you're using a very fast model interactively to make many very small edits. So you're still doing the conceptual work of programming, but outsourcing the plumbing (LLM handles details of syntax and stdlib).
  
  6 replies →
- NitpickLawyer 2 months ago
  
  The original definition was very different. The main thing with vibe coding is that you don't care about the code. You don't even look at the code. You prompt, test that you got what you wanted, and move on. You can absolutely use cc to vibe code. But you can also use it to ... code based on prompts. Or specs. Or docs. Or whatever else. The difference is if you want / care to look at the code or not.
- tomashubelbauer 2 months ago
  
  It sure doesn't feel like it given how closely I have to babysit Claude Code lest I don't recognize the code after Claude Code is done with it when left to its own devices for a minute.
  
  1 reply →
- sunaookami 2 months ago
  
  No, that's not the definition of "vibe coding". Vibe coding is letting the model do whatever without reviewing it and not understanding the architecture. This was the original definition and still is.

princehonest 2 months ago

Let's say you had a hardware budget of $5,000. What machine would you buy or build to run Devstral Small 2? The HuggingFace page claims it can run on a Mac with 32 GB of memory or an RTX 4090. What kind of tokens per second would you get on each? What about DGX Spark? What about RTX 5090 or Pro series? What about external GPUs on Oculink with a mini PC?

clusterhacks 2 months ago
All those choices seem to have very different trade-offs? I hate $5,000 as a budget - not enough to launch you into higher-VRAM RTX Pro cards, too much (for me personally) to just spend on a "learning/experimental" system.
I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system. I mean, if I was doing some more HPC/numerical programming (say, similarity search on GPUs :-) ), I could see just taking the hit and spending $15,000 on a workstation with an RTX Pro 6000.
For grins:
Max t/s for this and smaller models? RTX 5090 system. Barely squeezing in for $5,000 today and given ram prices, maybe not actually possible tomorrow.
Max CUDA compatibility, slower t/s? DGX Spark.
Ok with slower t/s, don't care so much about CUDA, and want to run larger models? Strix Halo system with 128gb unified memory, order a framework desktop.
Prefer Macs, might run larger models? M3 Ultra with memory maxed out. Better memory bandwidth speed, mac users seem to be quite happy running locally for just messing around.
You'll probably find better answers heading off to https://www.reddit.com/r/LocalLLaMA/ for actual benchmarks.
- kpw94 2 months ago
  
  > I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system.
  That's a good idea!
  Curious about this, if you don't mind sharing:
  - what's the stack ? (Do you run like llama.cpp on that rented machine?)
  - what model(s) do you run there?
  - what's your rough monthly cost? (Does it come up much cheaper than if you called the equivalent paid APIs)
  
  8 replies →
tgtweak 2 months ago
dual 3090's (24GB each) on 8x+8x pcie has been a really reliable setup for me (with nvlink bridge... even though it's relatively low bandwidth compared to tesla nvlink, it's better than going over pcie!)
48GB of vram and lots of cuda cores, hard to beat this value atm.
If you want to go even further, you can get an 8x V100 32GB server complete with 512GB ram and nvlink switching for $7000 USD from unixsurplus (ebay.com/itm/146589457908) which can run even bigger models and with healthy throughput. You would need 240V power to run that in a home lab environment though.
- lostmsu 2 months ago
  
  V100 is outdated (no bf16, dropped in CUDA 13) and power hungry (8 cards 3 years continuous use are about $12k of electricity).
  
  1 reply →
monster_truck 2 months ago
I'd throw a 7900xtx in an AM4 rig with 128gb of ddr4 (which is what I've been using for the past two years)
Fuck nvidia
- sofixa 2 months ago
  
  Or a Strix Halo Ryzen AI Max. Lots of "unified" memory that can be dedicated to the GPU portion, for not that expensive. Read through benchmarks to know if the performance will be enough for your needs though.
  
  1 reply →
- clusterhacks 2 months ago
  
  You know, I haven't even been thinking about those AMD gpus for local llms and it is clearly a blind spot for me.
  How is it? I'd guess a bunch of the MoE models actually run well?
  
  1 reply →
- androiddrew 2 months ago
  
  Get a Radeon AI Pro r9700! 32GB of RAM

eavan0 2 months ago

I'm glad it's not another LLM CLI that uses React. Vibe-cli seems to be built with https://github.com/textualize/textual/

kristianp 2 months ago
I'm not excited that it's done in python. I've had experience with Aider struggling to display text as fast as the llm is spitting it out, though that was probably 6 months ago now.
- willm 2 months ago
  
  Python is more than capable of doing that. It’s not an issue of raw execution speed.
  https://willmcgugan.github.io/streaming-markdown/
- NSPG911 2 months ago
  
  thats an issue with aider. using a proper framework in the alternate terminal buffer would have greatly benefitted them

zimbatm 2 months ago

Just added it to our inventory. For those of you using Nix:

    nix run github:numtide/llm-agents.nix#mistral-vibe

The repo is updated daily.

jquaint 2 months ago
This is such a cool project. Thanks for sharing.
- zimbatm 2 months ago
  
  Thanks! Playing with packaging automation is actually quite fun.

tucnak 2 months ago

I'm so glad Mistral never sold out. We're really lucky to have them in the EU at the time when we're so focused on mil-tech etc.

poszlem 2 months ago
They’ll switch to military tech the second it becomes necessary, don’t kid yourself. I’m just glad we have a European alternative for the day the US decides to turn its back on us.
This tech is simply too critical to pretend the military won’t use it. That’s clearer now than ever, especially after the (so far flop-ish) launch of the U.S. military’s own genAI platform.
- programLyrique 2 months ago
  
  They have already:
  - https://helsing.ai/newsroom/helsing-and-mistral-announce-str... - https://sifted.eu/articles/mistral-helsing-defence-ai-action... - Luxembourg army chose Mistral: https://www.forcesoperations.com/la-pepite-francaise-mistral... - French army: https://www.defense.gouv.fr/actualites/ia-defense-sebastien-...
- embedding-shape 2 months ago
  
  > I’m just glad we have a European alternative for the day the US decides to turn its back on us
  Not sure you've kept up to date, US have turned their backs on most allies so far including Europe and the EU, and now welcome previous enemies with open arms.
  
  1 reply →
- hobofan 2 months ago
  
  It's not like there aren't already military AI startups in the EU. e.g. Helsing.
- maelito 2 months ago
  
  > I’m just glad we have a European alternative for the day the US decides to turn its back on us.
  They did.
ismailmaj 2 months ago
I don’t think it was ever an option since it had ties with the french government early on (Cédric O) and Macron’s party is quite pro EU
- maelito 2 months ago
  
  They let so many important French companies down. So, yes, it could happen despite this beginning.

pzmarzly 2 months ago

10x cheaper price per token than Claude, am I reading it right?

As long as it doesn't mean 10x worse performance, that's a good selling point.

Macha 2 months ago

Something like GPT 5-mini is a lot cheaper than even Haiku but when I tried it in my experience it was so bad it was a waste of time. But it’s probably still more than 1/10 the performance of Haiku probably?
In work, where my employer pays for it, Haiku tends to be the workhorse with Sonnet or Opus when I see it flailing. On my own budget I’m a lot more cost conscious, so Haiku actually ends up being “the fancy model” and minimax m2 the “dumb model”.
phildougherty 2 months ago
Even if it is 10x cheaper and 2x worse it's going to eat up even more tokens spinning its wheels trying to implement things or squash bugs and you may end up spending more because of that. Or at least spending way more of your time.
- amarcheschi 2 months ago
  
  The benchmark of swe places it in a comparable score with respect to open models and just a few points below the top notch models though
fastball 2 months ago
Is it? The actual SOTA are not amazing at coding, so at least for me there is absolutely no reason to optimize on price at the moment. If I am going to use an LLM for coding it makes little sense to settle for a worse coder.
- gunalx 2 months ago
  
  I dunno. Even pretty weak models can be decently performant, and 9/10 the performance for 1/10 the price means 10x the output, and for a lot of stuff that quality difference dosent really matter. Considering even sota models are trash, slightly worse dosent really make that much difference.
  
  2 replies →

SyneRyder 2 months ago

I was briefly excited when Mistral Vibe launched and mentions "0 MCP Servers" in its startup screen... but I can't find how to configure any MCP servers. It doesn't respond to the /mcp command, and asking Devstral 2 for help, it thinks MCP is "Model Context Preservation". I'd really like to be able to run my local MCP tools that I wrote in Golang.

I'm team Anthropic with Claude Max & Claude Code, but I'm still excited to see Mistral trying this. Mistral has occasionally saved the day for me when Claude refused an innocuous request, and it's good to have alternatives... even if Mistral / Devstral seems to be far behind the quality of Claude.

tomashubelbauer 2 months ago
Check this out: https://github.com/mistralai/mistral-vibe?tab=readme-ov-file...
- SyneRyder 2 months ago
  
  Thank you! Finally got it working, had to comment out the mcp_servers line near the top of the config.toml file in ~/.vibe/, before adding my [[mcp_servers]] sections at the end of the file.
  That was very helpful, thanks!

rsolva 2 months ago

Ah, finally! I was checking just a few days ago if they had a Claude Code-like tool as I would much rather give money to a European effort. I'll stop my Pro subscription at Anthropic and switch over and test it out.

rubin55 2 months ago

This is great! I just made an AUR package for it: https://aur.archlinux.org/packages/mistral-vibe

alexmorley 2 months ago

Does anyone know where their SWE-bench Verified results are from? I can't find matching results on the leaderboards for their models or the Claude models and they don't provide any links.

mentalgear 2 months ago

Just tried it out via their free API and the Roo Code VSCode extension, and it's impressive. It walked through a data analytics and transformation problem (150.000 dataset entries) I have been debugging for the past 2 hours.

joostdevries 2 months ago

Very nice that there's a coding cli finally. I have a Mistral Pro account. I hope that it will be included. It's the main reason to have a Pro account tbh.

pedrozieg 2 months ago

The interesting bit in the blog isn’t the 72.2% SWE-Bench Verified number, it’s their own human eval: Devstral 2 beats DeepSeek V3.2 in Cline-style workflows but still loses clearly to Claude Sonnet 4.5. That’s a nice reminder that “open SOTA” on a single benchmark doesn’t mean “best tool for the job” once you’re doing multi-step edits across a messy real repo.

What is a big deal here is the combination of licensing and packaging. A 123B dense code model under a permissive license plus an open-source CLI agent (Vibe) that already speaks ACP is basically a reference stack for “bring your own infra + agents” instead of renting someone else’s SaaS IDE. If that ecosystem hardens (Cline, Kilo, Vibe, etc.), the moat shifts from “we have the only good code model” to “we own the best workflows and integrations”, and that’s a game open models can realistically win.

simonw 2 months ago

The system prompt and tool prompts for their open source (Apache 2 licensed) Python+Textual+Pydantic CLI tool are fun to read:

core/prompts/cli.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...

core/prompts/compact.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...

.../prompts/bash.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...

.../prompts/grep.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...

.../prompts/read_file.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...

.../prompts/write_file.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...

.../prompts/search_replace.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...

.../prompts/todo.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...

giancarlostoro 2 months ago
Based on your experience with Claude Code, how does Mistral Vibe compare?
- simonw 2 months ago
  
  I've not spent enough time with Mistral Vibe yet for a credible comparison, but given what I know about the underlying models (likely-1T-plus Opus 4.5 compared to the 123B Devstral 2) I'd be shocked if Vibe could out-perform Claude Code for the kinds of things I'm using it for.
  Here's n example of the kinds of things I do with Claude Code now: https://gistpreview.github.io/?b64d5ee40439877eee7c224539452... - that one involved several from-scratch rewrites of the history of an entire Git repo just because I felt like it.

weitendorf 2 months ago

Open sourcing the TUI is pretty big news actually. Unless I missed something, I had to dig a bit to find it, but I think this is it: https://github.com/mistralai/mistral-vibe

Going to start hacking on this ASAP

KronisLV 2 months ago

Really good stats! Sadly them being dense models will make them slow compared to something like Qwen3. Honestly my main reason not to use Mistral models when I need something on-prem with limited hardware (think Nvidia L4 GPUs).

pshirshov 2 months ago

> Mistral Code is available with enterprise deployments. > Contact our team to get started.

The competition is much smoother. Where are the subscriptions which would give users the coding agent and the chat for a flat fee and working out of the box?..

badsectoracula 2 months ago

> Devstral 2 ships under a modified MIT license, while Devstral Small 2 uses Apache 2.0. Both are open-source and permissively licensed to accelerate distributed intelligence.

Uh, the "Modified MIT license" here[0] for Devstral 2 doesn't look particularly permissively licensed (or open-source):

> 2. You are not authorized to exercise any rights under this license if the global consolidated monthly revenue of your company (or that of your employer) exceeds $20 million (or its equivalent in another currency) for the preceding month. This restriction in (b) applies to the Model and any derivatives, modifications, or combined works based on it, whether provided by Mistral AI or by a third party. You may contact Mistral AI (sales@mistral.ai) to request a commercial license, which Mistral AI may grant you at its sole discretion, or choose to use the Model on Mistral AI's hosted services available at https://mistral.ai/.

[0] https://huggingface.co/mistralai/Devstral-2-123B-Instruct-25...

Arcuru 2 months ago
Personally I really like the normalization of these "Permissively" licensed models that only restrict companies with massive revenues from using them for free.
If you want to use something, and your company makes $240,000,000 in annual revenue, you should probably pay for it.
- badsectoracula 2 months ago
  
  These are not permissively licensed though, the terms "permissive license" has connotations that pretty much everyone who is into FLOSS understands (same with "open source").
  I do not mind having a license like that, my gripe is with using the terms "permissive" and "open source" like that because such use dilutes them. I cannot think of any reason to do that aside from trying to dilute the term (especially when some laws, like the EU AI Act, are less restrictive when it comes to open source AIs specifically).
  
  5 replies →
- whimsicalism 2 months ago
  
  That's fine, but I don't think you should call it open source or call it MIT or even 'modified MIT.' Call it Mistral license or something along those lines
  
  20 replies →
mkmk3 2 months ago
Earnestly, what's the concern here? People complain about open source being mostly beneficial to megacorps, if that's the main change (idk I haven't looked too closely) then that's pretty good, no?
- JimDabell 2 months ago
  
  They are claiming something is open-source when it isn’t. Regardless of whether you think the deviation from open-source is a good thing or not, you should still be in favour of honesty.
  
  16 replies →
- badsectoracula 2 months ago
  
  Mainly about the dilution of the term. Though TBH i do not think that open source is beneficial mostly to megacorps either.
simonw 2 months ago
Mistral have used janky licenses in that a few times in the past. I was hoping the competition from China might have snapped them out of it.
- jrm4 2 months ago
  
  All "Open Source" licenses are to an extent, janky. Obligatory "Stallman was right;" -- If it's not GPL/Free Software, YMMV.
squigz 2 months ago
Is such a term even enforceable? How would it be? How could Mistral know how much a company makes if that information isn't public?
- lillecarl 2 months ago
  
  They don't have to enforce it, evil megacorps won't risk the legal consequences of using it without talking to Mistral first. In reality they just won't use it.

syntaxing 2 months ago

Extremely happy with this release, the previous Devstral was great but training it for open hands crippled the usefulness. Having their own CLI dev tool will hopefully be better

kristianp 2 months ago
Can you explain "training it for open hands"? I can't parse the meaning.
- syntaxing 2 months ago
  
  The original Devstral was a collaboration between All Hands AI (OpenHands) and Mistral [1]. You can use it with other agents but had to transfer over the prompt. Even then, the agents still didn't work that well. I tried it in RooCline and it worked extremely poorly with the tool calls.
  [1] https://openhands.dev/blog/devstral-a-new-state-of-the-art-o...

lgrapenthin 2 months ago

I tried this on a small Clojure codebase and asked it to write some tests. It couldn't get its parentheses balanced. After 10 attempts or so it tried to write a smaller test file first, but again failed. Regardless of the parentheses, the test code it came up with was quite basic and arbitrary. It didn't try to come up with interesting edge cases or anything.

moffkalast 2 months ago

Looks like another Deepseek distil like the new Ministrals. For every other use case that would be an insult, but for coding that's a great approach given how much lead in coding performance Qwen and Deepseek have on Mistral's internal datasets. The Small 24B seems to have a decent edge on 30BA3B, though it'll be comparatively extremely slow to run.

therealmarv 2 months ago

offtopic but it hurts my eyes: I dislike for their font choice and their "cool looks" in their graphics.

Surprising and good is only: Everything including graphics fixed when clicking my "speedreader" button in Brave. So they are doing that "cool look" by CSS.

netghost 2 months ago

Yeah, it's a bit gimicky. You can hit `esc` and it will revert to the normal page design.
There's a scan lines affect they apply to everything that's "cool", but gets old after a minute.

rwky 2 months ago

I gave it the job of modifying a fairly simple regex replacement and it took a while over 5 minutes, claude failed on the same prompt (which surprised me), codex did a similar job but faster. So all in all not bad!

da_grift_shift 2 months ago

Can Vibe CLI help me vibe code PRs for when I vibe on the https://github.com/buttplugio/buttplug repo?

andai 2 months ago

You can do anything if you believe.

whimsicalism 2 months ago

> Model Size (B tokens)

How is that a measure of model size? It should either be parameter size, activated parameters, or cost per output token.

Looks like a typo because the models line up with reported param sizes.

maelito 2 months ago

Finally, we can use a european model to replace claude code.

rsolva 2 months ago

Think I found a bug? After an hour of light use on a small project, the TUI started to lag quite heavily and became less and less responsive over time.

tigranbs 2 months ago

Somehow it writes bad React code and misses to check linting prompts half the time. But surprisingly, the Python coding was great!

Squarex 2 months ago

SWE-bench is all python. Hope is not overly optimized for it.

Poudlardo 2 months ago

will definetey try mistral vibe with gpt-oss-20b

rsolva 2 months ago

They offer an extension for Zed at launch, fantastic! Did not spot that when first skimming through the page.

lacoolj 2 months ago

Wonder why Gemini 3 Pro and Sonnet 4.5 are on this comparison but Opus 4.5 is not?

qwertox 2 months ago

Let's see which company becomes the first to sell "coding appliances": hardware with a model good enough for normal coding.

If Mistral is so permissive they could be the first ones, provided that hardware is then fast/cheap/efficient enough to create a small box that can be placed in an office.

Maybe in 5 years.

giancarlostoro 2 months ago

My Macbook Pro with an M4 Pro chip can handle a number of these models (I think it has 16GB of VRAM) with reasonable performance, my bottleneck continuously is the token caps. I assume someone with a much more powerful Mac Studio could run way more than I can, considering they get access to about 96GB of VRAM out of the system RAM iirc.
bakies 2 months ago
I bought a framework desktop hoping to do this.
- sosodev 2 months ago
  
  And it can do it, right? I think AMD AI Max line the first realistic offering for this type of thing.
  The Apple offerings are interesting but the lack of x86, Linux, and general compatibility make it hard sell imo.
  
  2 replies →
brazukadev 2 months ago

my bet is a deepseek box
baq 2 months ago

llm in a box connected via usb is the dream.
...so it won't ever happen, it'll require wifi and will only be accessible via the cloud, and you'll have to pay a subscription fee to access the hardware you bought. obviously.

kevin061 2 months ago

I am very disappointed they don't have an equivalent subscription for coding to the 200 EUR ChatGPT or Claude one, and it is only available for Enterprise deployments.

The only thing I found is a pay-as-you-go API, but I wonder if it is any good (and cost-effective) vs Claude et al.

pzo 2 months ago
> Devstral 2 is currently offered free via our API. After the free period, the API pricing will be $0.40/$2.00 per million tokens (input/output) for Devstral 2
With pricing so low I don't see any reason why someone would buy sub for 200 EUR. These days those subs are so much limited in Claude Code or Cursor than it used to be (or used to unlimited). Better pay-as-you-go especially when there are days when you probably use AI less or not at all (weekends/holidays etc.) as long as those credits don't expire.
- kevin061 2 months ago
  
  True, I just wish I could pay once for code AND the chat, but the chat subscription does not include Code sadly.
esafak 2 months ago

At these rates you can afford to pay by the token.

abuson 2 months ago

did anyone test how up to date is knowledge?

After querying the model about .NET, it seems that its knowledge comes from around June 2024.

huqedato 2 months ago

I confirm that. It had no idea how to use Deno v2+.

cyp0633 2 months ago

In a figure: Model size (B tokens)?

justinclift 2 months ago

Interesting. It sounds like using local LLMs (via vllm, ollama, etc) with decent agentic capability might be starting to become a reality.

Next step, just need a shitload of vram. ;)

Maybe those Intel Battlematrix 48GB cards might be useful after all... :)

https://www.storagereview.com/review/intel-arc-pro-b60-battl...

patrick4urcloud 2 months ago

their cli is awesome nicer than others :)

jedisct1 2 months ago

Yet another CLI.

Why does every AI provider need to have its own tool, instead of contributing to existing tools like Roo Code or Opencode?

cmrdporcupine 2 months ago

I just end up using most of these models with Claude Code as the tooling because it just seems to work better than anything else. Crush also works well.
Lapel2742 2 months ago
My 2ct: Because providers want to make their model run optimally and maybe some of them try to build a moat.
- jedisct1 2 months ago
  
  > providers want to make their model run optimally
  Because they couldn't do it by contributing to existing opensource tools?

a_state_full 2 months ago

[dead]

villgax 2 months ago

Modified MIT?????

Just call it Mistral License & flush it down

tgtweak 2 months ago

PSA: 10X savings when you have to prompt it 10 times to get the correct solution is not actually faster.