Mistral 3 family of models released

2 months ago (mistral.ai)

I use large language models in http://phrasing.app to format data I can retrieve in a consistent skimmable manner. I switched to mistral-3-medium-0525 a few months back after struggling to get gpt-5 to stop producing gibberish. It's been insanely fast, cheap, reliable, and follows formatting instructions to the letter. I was (and still am) super super impressed. Even if it does not hold up in benchmarks, it still outperformed in practice.

I'm not sure how these new models compare to the biggest and baddest models, but if price, speed, and reliability are a concern for your use cases I cannot recommend Mistral enough.

Very excited to try out these new models! To be fair, mistral-3-medium-0525 still occasionally produces gibberish ~0.1% of my use cases (vs gpt-5's 15% failure rate). Will report back if that goes up or down with these new models

  • Some time ago I canceled all my paid subscriptions to chatbots because they are interchangeable so I just rotate between Grok, ChatGPT, Gemini, Deepseek and Mistral.

    On the API side of things my experience is that the model behaving as expected is the greatest feature.

    There I also switched to Openrouter instead of paying directly so I can use whatever model fits best.

    The recent buzz about ad-based chatbot services is probably because the companies no longer have an edge despite what the benchmarks say, users are noticing it and cancel paid plans. Just today OpenAI offered me 1 month free trial as if I wasn’t using it two months ago. I guess they hope I forget to cancel.

    • Yep I spent 3 days optimizing my prompt trying to get gpt-5 to work. Tried a bunch of different models (some Azure some OpenRouter) and got a better success rate with several others without any tailoring of the prompt.

      Was really plug and play. There are still small nuances to each one, but compared to a year ago prompts are much more portable

      2 replies →

    • > because they are interchangeable

      What is your use-case?

      Mine is: I use "Pro"/"Max"/"DeepThink" models to iterate on novel cross-domain applications of existing mathematics.

      My interaction is: I craft a detailed prompt in my editor, hand it off, come back 20-30 minutes later, review the reply, and then repeat if necessary.

      My experience is that they're all very, very different from one another.

      1 reply →

  • This is my experience as well. Mistral models may not be the best according to benchmarks and I don't use them for personal chats or coding, but for simple tasks with pre-defined scope (such as categorization, summarization, etc.) they are the option I choose. I use mistral-small with batch API and it's probably the best cost-efficient option out there.

  • It makes me wonder about the gaps in evaluating LLMs by benchmarks. There almost certainly is overfitting happening which could degrade other use cases. "In practice" evaluation is what inspired the Chatbot Arena right? But then people realized that Chatbot arena over-prioritizes formatting, and maybe sycophancy(?). Makes you wonder what the best evaluation would be. We probably need lots more task-specific models. That's seemed to be fruitful for improved coding.

    • The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.

      8 replies →

    • I don’t think benchmark overfitting is as common as people think. Benchmark scores are highly correlated with the subjective “intelligence” of the model. So is pretraining loss.

      The only exception I can think of is models trained on synthetic data like Phi.

    • If the models from the big US labs are being overfit to benchmarks, than we also need to account for HN commenters overfitting positive evaluations to Chinese or European models based on their political biases (US big tech = default bad, anything European = default good).

      Also, we should be aware of people cynically playing into that bias to try to advertise their app, like OP who has managed to spam a link in the first line of a top comment on this popular front page article by telling the audience exactly what they want to hear ;)

      1 reply →

  • Thanks for sharing your use case of the mistral models, which are indeed top-notch ! I had a look at phrasing.app, and while a nice website, I found the copy of "Hand-crafted. Phrasing was designed & developed by humans, for humans." somewhat of a false virtue given your statements here of advanced lllm usage.

    • I don't see the contention. I do not use llms in the design, development, copywriting, marketing, blogging, or any other aspect of the crafting of the application.

      I labor over every word, every button, every line of code, every blog post. I would say it is as hand-crafted as something digital can be.

      3 replies →

  • Are you saying gpt-5 produces gibberish 15% of the time? Or are you comparing Mistral gibberish production rate to gpt-5.1's complex task failure rate?

    Does Mistral even have a Tool Use model? That would be awesome to have a new coder entrant beyond OpenAI, Anthropic, Grok, and Qwen.

    • Yes. I spent about 3 days trying to optimize the prompt to get gpt-5 to not produce gibberish, to no avail. Completions took several minutes, had an above 50% timeout rate (with a 6 minute timeout mind you), and after retrying they still would return gibberish about 15% of the time (12% on one task, 20% on another task).

      I then tried multiple models, and they all failed in spectacular ways. Only Grok and Mistral had an acceptable success rate, although Grok did not follow the formatting instructions as well as Mistral.

      Phrasing is a language learning application, so the formatting is very complicated, with multiple languages and multiple scripts intertwined with markdown formatting. I do include dozens of examples in the prompts, but it's something many models struggle with.

      This was a few months ago, so to be fair, it's possible gpt-5.1 or gemini-3 or the new deepseek model may have caught up. I have not had the time or need to compare, as Mistral has been sufficient for my use cases.

      I mean, I'd love to get that 0.1% error rate down, but there have always more pressing issues XD

      7 replies →

  • I have a need to remove loose "signature" lines from the last 10% of a tremendous e-mail dataset. Based on your experience, how do you think mistral-3-medium-0525 would do?

    • What's your acceptable error rate? Honestly ministral would probably be sufficient if you can tolerate a small failure rate. I feel like medium would be overkill.

      But I'm no expert. I can't say I've used mistral much outside of my own domain.

      2 replies →

The new large model uses DeepseekV2 architecture. 0 mention on the page lol.

It's a good thing that open source models use the best arch available. K2 does the same but at least mentions "Kimi K2 was designed to further scale up Moonlight, which employs an architecture similar to DeepSeek-V3".

---

vllm/model_executor/models/mistral_large_3.py

```

from vllm.model_executor.models.deepseek_v2 import DeepseekV3ForCausalLM

class MistralLarge3ForCausalLM(DeepseekV3ForCausalLM):

```

"Science has always thrived on openness and shared discovery." btw

Okay I'll stop being snarky now and try the 14B model at home. Vision is good additional functionality on Large.

  • So they spent all of their R&D to copy deepseek, leaving none for the singular novel added feature: vision.

    To quote the hf page:

    >Behind vision-first models in multimodal tasks: Mistral Large 3 can lag behind models optimized for vision tasks and use cases.

    • Well, behind "models" not "langual models".

      Of course models purely made for image stuff will completely wipe it out. The vision language models are useful for their generalist capabilities

  • Architecture difference wrt vanilla transformers and between modern transformers are a tiny part of what makes a model nowadays

  • I don't think it's fair to demand everything be open and then get mad when they open-ness is used. It's an obsessive and harmful double standard.

The 3B vision model runs in the browser (after a 3GB model download). There's a very cool demo of that here: https://huggingface.co/spaces/mistralai/Ministral_3B_WebGPU

Pelicans are OK but not earth-shattering: https://simonwillison.net/2025/Dec/2/introducing-mistral-3/

Europe's bright star has been quiet for a while, great to see them back and good to see them come back to Open Source light with Apache 2.0 licenses - they're too far from the SOTA pack that exclusive/proprietary models would work in their favor.

Mistral had the best small models on consumer GPUs for a while, hopefully Ministral 14B lives up to their benchmarks.

Extremely cool! I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release, so it's easier to know how it fares in the grand scheme of things.

  • I guess that could be considered comparative advertising then and companies generally try to avoid that scrutiny.

  • The lack of the comparison (which absolutely was done), tells you exactly what you need to know.

    • I think people from the US often aren't aware how many companies from the EU simply won't risk losing their data to the providers you have in mind, OpenAI, Anthropic and Google. They simply are no option at all.

      The company I work for for example, a mid-sized tech business, currently investigates their local hosting options for LLMs. So Mistral certainly will be an option, among the Qwen familiy and Deepseek.

      Mistral is positioning themselves for that market, not the one you have in mind. Comparing their models with Claude etc. would mean associating themselves with the data leeches, which they probably try to avoid.

      13 replies →

    • They're comparing against open weights models that are roughly a month away from the frontier. Likely there's an implicit open-weights political stance here.

      There are also plenty of reasons not to use proprietary US models for comparison: The major US models haven't been living up to their benchmarks; their releases rarely include training & architectural details; they're not terribly cost effective; they often fail to compare with non-US models; and the performance delta between model releases has plateaued.

      A decent number of users in r/LocalLlama have reported that they've switched back from Opus 4.5 to Sonnet 4.5 because Opus' real world performance was worse. From my vantage point it seems like trust in OpenAI, Anthropic, and Google is waning and this lack of comparison is another symptom.

      2 replies →

    • If someone is using these models, they probably can't or won't use the existing SOTA models, so not sure how useful those comparisons actually are. "Here is a benchmark that makes us look bad from a model you can't use on a task you won't be undertaking" isn't actually helpful (and definitely not in a press release).

      4 replies →

    • Here's what I understood from the blog post:

      - Mistral Large 3 is comparable with the previous Deepseek release.

      - Ministral 3 LLMs are comparable with older open LLMs of similar sizes.

      6 replies →

  • > I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release,

    Why would they? They know they can't compete against the heavily closed-source models.

    They are not even comparing against GPT-OSS.

    That is absolutely and shockingly bearish.

Upvoting for Europe's best efforts.

I don't like being this guy, but I think Deepseek 3.2 stole all the thunder yesterday. Notice that these comparisons are to Deepseek 3.1. Deepseek 3.2 is a big step up over 3.1, if benchmarks are to be believed. Just unfortunate timing of release. https://api-docs.deepseek.com/news/news251201

  • Idk. They look like they're ahead on the saturated benchmarks and behind on the unsaturated ones. Looks more like that over fit to the benchmarks.

I still don't understand what the incentive is for releasing genuinely good model weights. What makes sense however is OpenAI releasing a somewhat generic model like gpt-oss that games the benchmarks just for PR. Or some Chinese companies doing the same to cut the ground from under the feet of American big tech. Are we really hopeful we'll still get decent open weights models in the future?

  • Because there is no money in making them closed.

    Open weight means secondary sales channels like their fine tuning service for enterprises [0].

    They can't compete with large proprietary providers but they can erode and potentially collapse them.

    Open weights and research builds on itself advancing its participants creating environment that has a shot at proprietary services.

    Transparency, control, privacy, cost etc. do matter to people and corporations.

    [0] https://mistral.ai/solutions/custom-model-training

  • Until there is a sustainable, profitable and moat-building business model for generative AI, the competition is not to have the best proprietary model, but rather to raise the most VC money to be well positioned when that business model does arise.

    Releasing a near stat-of-the-art open model instanly catapults companies to a valuation of several billion dollars, making it possible raise money to acquire GPUs and train more SOTA models.

    Now, what happens if such a business model does not emerge? I hope we won't find out!

  • > gpt-oss that games the benchmarks just for PR.

    gpt-oss is killing the ongoing AIME3 competition on kaggle. They're using a hidden, new set of problems, IMO level, handcrafted to be "AI hardened". And gpt-oss submissions are at ~33/50 right now, two weeks into the competition. The benchmarks (at least for math) were not gamed at all. They are really good at math.

  • Google games benchmarks more than anyone, hence Gemini's strong bench lead. In reality though, it's still garbage for general usage.

If the claims on multilingual and pretraining performance are accurate, this is huge! This may be the best-in-class multilingual stuff since the more recent Gemma's, where they used to be unmatched. I know Americans don't care much about the rest of the world, but we're still using our native tongues thank you very much; there is a huge issue with i.e. Ukrainian (as opposed to Russian) being underrepresented in many open-weight and weight-available models. Gemma used to be a notable exception, I wonder if it's still the case. On a different note: I wonder why scores on TriviaQA vis-a-vis 14b model lags behind Gemma 12b so much; that one is not a formatting-heavy benchmark.

  • > I wonder why scores on TriviaQA vis-a-vis 14b model lags behind Gemma 12b so much; that one is not a formatting-heavy benchmark.

    My guess is the vast scale of google data. They've been hoovering data for decades now, and have had curation pipelines (guided by real human interactions) since forever.

Anyone else find that despite Gemini performing best on benches, it's actually still far worse than ChatGPT and Claude? It seems to hallucinate nonsense far more frequently than any of the others. Feels like Google just bench maxes all day every day. As for Mistral, hopefully OSS can eat all of their lunch soon enough.

  • No, I've been using Gemini for help while learning / building my onprem k8s cluster and it has been almost spotless.

    Granted, this is a subject that is very well present in the training data but still.

    • I found gemini 3 to be pretty lackluster for setting up an onprem k8s cluster - sonnet 4.5 was more accurate from the get go, required less handholding

  • Open weight LLMs aren't supposed to "beat" closed models, and they never will. That isn’t their purpose. Their value is as a structural check on the power of proprietary systems; they guarantee a competitive floor. They’re essential to the ecosystem, but they’re not chasing SOTA.

    • This may be the case, but DeepSeek 3.2 is "good enough" that it competes well with Sonnet 4 -- maybe 4.5 -- for about 80% of my use cases, at a fraction of the cost.

      I feel we're only a year or two away from hitting a plateau with the frontier closed models having diminishing returns vs what's "open"

      1 reply →

    • > Open weight LLMs aren't supposed to "beat" closed models, and they never will. That isn’t their purpose.

      Do things ever work that way? What if Google did Open source Gemini. Would you say the same? You never know. There's never "supposed" and "purpose" like that.

      2 replies →

  • Yep, Gemini is my least favorite and I’m convinced that the hype around it isn’t organic because I don’t see the claimed “superiority”, quite the opposite.

    • I think a lot of the hype around Gemini comes down to people who aren't using it for coding but for other things maybe.

      Frankly, I don't actually care about or want "general intelligence" -- I want it to make good code, follow instructions, and find bugs. Gemini wasn't bad at the last bit, but wasn't great at the others.

      They're all trying to make general purpose AI, but I just want really smart augmentation / tools.

      1 reply →

  • No? My recent experience with Gemini was terrific. The last big test I gave of Claude it spun an immaculate web of lies before I forced it to confess.

  • What does your comment have to do with the submission? What a weird non-sequitur. I even went looking at the linked article to see if it somehow compares with Gemini. It doesn't, and only relates to open models.

    In prior posts you oddly attack "Palantir-partnered Anthropic" as well.

    Are things that grim at OpenAI that this sort of FUD is necessary? I mean, I know they're doing the whole code red thing, but I guarantee that posting nonsense like this on HN isn't the way.

  • I also had bad luck when I finally tried Gemini 3 in the gemini CLI coding tool. I am unclear if it's the model or their bad tooling/prompting. It had, as you said, hallucination problems, and it also had memory issues where it seemed to drop context between prompts here and there.

    It's also slower than both Opus 4.5 and Sonnet.

  • My experience is the opposite although I don't use it to write code but to explore/learn about algorithms and various programming ideas. It's amazing. I am close to cancelling my ChatGPT subscription (I would only use Open Router if it had nicer GUI and dark mode anyway).

  • If anything it's a testament to human intelligence that benchmarks haven't really been a good measure of a model's competence for some time now. They provide a relative sorting to some degree, within model families, but it feels like we've hit an AI winter.

  • Yes, and likewise with Kimi K2. Despite being on the top of open source benches it makes up more batshit nonsense than even Llama 3.

    Trust no one, test your use case yourself is pretty much the only approach, because people either don't run benchmarks correctly or have the incentive not to.

Geometric mean of MMMLU + GPQA-Diamond + SimpleQA + LiveCodeBench :

- Gemini 3.0 Pro : 84.8

- DeepSeek 3.2 : 83.6

- GPT-5.1 : 69.2

- Claude Opus 4.5 : 67.4

- Kimi-K2 (1.2T) : 42.0

- Mistral Large 3 (675B) : 41.9

- Deepseek-3.1 (670B) : 39.7

The 14B 8B & 3B models are SOTA though, and do not have chinese censorship like Qwen3.

Since no one has mentioned it yet: note that the benchmarks for large are for the base model, not for the instruct model available in the API.

Most likely reason is that the instruct model underperforms compared to the open competition (even among non-reasoners like Kimi K2).

Well done to the France's Mistral team for closing the gap. If the benchmarks are to be believed, this is a viable model, especially at the edge.

Congrats on the release, Mistral team!

I haven't used Mistral much until today but am impressed. I normally use Gemma 3 27B locally, but after regenerating some responses with Mistral 3 14B, the output quality is very similar despite generating much faster on my hardware.

The vision aspect also worked fine, and actually was slightly better on the same inputs versus qwen3 VL 8B.

All in all impressive small dense model, looking forward to using it more.

It's sad that they only compare to open weight models. I feel most users don't care much about OSS/not OSS. The value proposition is the quality of the generation for some use case.

I guess it says a bit about the state of European AI

  • It’s not for users but for businesses. There is demand for inhouse use with data privacy. Regular users can’t even run large model due to lack of compute.

  • Glad I'm not most users. I'm down for 80% of the quality for an open weight model. Hell I've been using Linux for 25 years so I suppose I'm used to not-the-greatest-but-free.

  • It seems to be a reasonable comparison since that is the primary/differentiating characteristic of the model. It’s really common to also and seemingly only ever see the comparison of closed weight/proprietary models in a way that seems to act as if all of the non-American and open weight models don’t even exist.

    I also think most people do not consider open weights as OSS.

I use a small model as a chatbot of sorts in a game I'm making. I was hoping the 3b could replace qwen 4b, but it's far worse at following instructions and providing entertaining content. I suppose this is expected given smaller size and their own benchmarks that show Qwen beating it at instruct.

Looking forward to trying them out. Great to see they are Apache 2.0...always good to have easy-to-understand licensing.

I wish they showed how they compared to models larger/better and what the gap is, rather than only models they're better than.

Like how does 14B compare to Qwen30B-A3B?

(Which I think is a lot of people's goto or it's instruct/coding variant, from what I've seen in local model circles)

The small dense model seems particularly good for their small sizes, I can't wait to test them out.

Do all of these models, regardless of parameters, support tool use and structured output?

  • In principle any model can do these. Tool use is just detecting something like "I should run a db query for pattern X" and structured output is even easier, just reject output tokens that don't match the grammar. The only question is how well they're trained, and how well your inference environment takes advantage.

Urg, the bar charts to not start at 0. It's making it impossible to compare across model sizes. That's a pretty basic chart design principle. I hope they can fix it. At least give me consistent y scales!

I haven't tried a Mistral model in ages. Llama and Mistral feel like something I was using in another era. Are they good?

I find that there are too many paid sub models at the minute with non legitimate progress to warrant the money spent. Recently cancelled GPT.

Anyone succeed in running it with vLLM?

  • The instruct models are available on Ollama (e.g. `ollama run ministral-3:8b`), however the reasoning models still are a wip. I was trying to get them to work last night and it works for single turn, but is still very flakey w/ multi-turn.

  • Yes, the 3B variant, with vLLM 0.11.2. Parameters are given on the HF page. Had to override the temperature to 0.15 though (as suggested on HF) to avoid random looking syllables.

I was subscribing to these guys purely to support the EU tech scene. So I was on Pro for about 2 years while using ChatGPT and Claude.

Went to actually use it, got a message saying that I missed a payment 8 months previously and thus wasn't allowed to use Pro despite having paid for Pro for the previous 8 months. The lady I contacted in support simply told me to pay the outstanding balance. You would think if you missed a payment it would relate to simply that month that was missed not all subsequent months.

Utterly ridiculous that one missed payment can justify not providing the service (otherwise paid for in full) at all.

Basically if you find yourself in this situation you're actually better of deleting the account and resigning up again under a different email.

We really need to get our shit together in the EU on this sort of stuff, I was a paying customer purely out of sympathy but that sympathy dried up pretty quick with hostile customer service.

  • I'm not sure I understand you correctly, but it seems you had a subscription missed one payment some time ago, but now expect that your subscription works because the missed month was in the past and "you paid for this month"?

    This sounds like the you expect your subscription to work as an on-demand service? It seems quite obvious that to be able to use a service you would need to be up to date on your payments, that would be no different in any other subscription/lease/rental agreement? Now Mistral might certainly look back at their records and see that you actually didn't use their service at all for the last few month and waive the missed payment. And that could be good customer service, but they might not even have record that you didn't use it, or at least those records would not be available to the billing department?

    • >This sounds like the you expect your subscription to work as an on-demand service?

      That's exactly what it is.

      >I'm not sure I understand you correctly,

      I understand perfectly well, I don't agree with that approach is the issue.

      If I paid for 11/12 months I should get 11/12 months subscription not 1/12 months. They happily just took a years subscription and provided nothing in return. Even if I fixed the outstanding balance they would have provided 2/12 months of service at a cost of 12/12 months of payment.

  • This seems like a legitimate complaint... I wonder why it's downvoted

    • My critique is more levelled at Mistral and not specifically what they've just released so it could be that some see what I have to say as off topic.

      Also a lot of Europeans are upset at US tech dominance. It's a position we've roped ourselves in to so any commentary that criticises an EU tech success story is seen as being unnecessarily negative.

      However I do mean it as a warning to others, I got burned even with good intentions.

I am not sure why Meta paid 13B+ to hire some kid vs just hiring back or acquiring these folks. They'll easily catch up.

  • Age aside, not sure what Zuck was thinking, seeing as Scale AI was in data labelling and not training models, perhaps he thought he was a good operator? Then again the talent scarcity is in scientists, there are many operators, let alone one worth 14B. Back to age, the people he is managing are likely all several years older than him and Meta long timers, which would make it even more challenging

  • What is this referring to? I googled and the company was founded in 2016. No one involved can to a “kid”?

    • True no one involved in Scale AI right now is a kid. But, their expertise is in data labelling not cutting edge AI. Compare that to the Mistral team. They launched a new LLM within 6months of founding. They're also ex-Meta researchers. But they dont have the distribution coz europe. If we want to tout 13B acqusitions and 100m pay packages, Mistral is the perfect candidate. Its basically plug and play. Compare that to Scale and the shitshow that ensued. MSL lost talent and have to start from scratch given that their head knows nothing about LLMs.