Comment by timpera
3 months ago
Extremely cool! I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release, so it's easier to know how it fares in the grand scheme of things.
3 months ago
Extremely cool! I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release, so it's easier to know how it fares in the grand scheme of things.
They mentioned LMArena, you can get the results for that here: https://lmarena.ai/leaderboard/text
Mistral Large 3 is ranked 28, behind all the other major SOTA models. The delta between Mistral and the leader is only 1418 vs. 1491 though. I *think* that means the difference is relatively small.
1491 vs 1418 ELO means the stronger model wins about 60% of the time.
Probably naive questions:
Does that also mean that Gemini-3 (the top ranked model) loses to mistral 3 40% of the time?
Does that make Gemini 1.5x better, or mistral 2/3rd as good as Gemini, or can we not quantify the difference like that?
3 replies →
I guess that could be considered comparative advertising then and companies generally try to avoid that scrutiny.
The lack of the comparison (which absolutely was done), tells you exactly what you need to know.
I think people from the US often aren't aware how many companies from the EU simply won't risk losing their data to the providers you have in mind, OpenAI, Anthropic and Google. They simply are no option at all.
The company I work for for example, a mid-sized tech business, currently investigates their local hosting options for LLMs. So Mistral certainly will be an option, among the Qwen familiy and Deepseek.
Mistral is positioning themselves for that market, not the one you have in mind. Comparing their models with Claude etc. would mean associating themselves with the data leeches, which they probably try to avoid.
We're seeing the same thing for many companies, even in the US. Exposing your entire codebase to an unreliable third party is not exactly SOC / ISO compliant. This is one of the core things that motivated us to develop cortex.build so we could put the model on the developer's machine and completely isolate the code without complicated model deployments and maintenance.
Does your company use Microsoft Teams?
Mistral is founded by multiple Meta engineers, no?
Funded mostly by US VCs?
Hosted primarily on Azure?
Do you really have to go out of your way to start calling their competition "data leeches" for out-executing them?
10 replies →
They're comparing against open weights models that are roughly a month away from the frontier. Likely there's an implicit open-weights political stance here.
There are also plenty of reasons not to use proprietary US models for comparison: The major US models haven't been living up to their benchmarks; their releases rarely include training & architectural details; they're not terribly cost effective; they often fail to compare with non-US models; and the performance delta between model releases has plateaued.
A decent number of users in r/LocalLlama have reported that they've switched back from Opus 4.5 to Sonnet 4.5 because Opus' real world performance was worse. From my vantage point it seems like trust in OpenAI, Anthropic, and Google is waning and this lack of comparison is another symptom.
Scale AI wrote a paper a year ago comparing various models performance on benchmarks to performance on similar but held-out questions. Generally the closed source models performed better, and Mistral came out looking pretty badly: https://arxiv.org/pdf/2405.00332
??? Closed US frontier models are vastly more effective than anything OSS right now, the reason they didn’t compare is because they’re a different weight class (and therefore product) and it’s a bit unfair.
We’re actually at a unique point right now where the gap is larger than it has been in some time. Consensus since the latest batch of releases is that we haven’t found the wall yet. 5.1 Max, Opus 4.5, and G3 are absolutely astounding models and unless you have unique requirements some way down the price/perf curve I would not even look at this release (which is fine!)
If someone is using these models, they probably can't or won't use the existing SOTA models, so not sure how useful those comparisons actually are. "Here is a benchmark that makes us look bad from a model you can't use on a task you won't be undertaking" isn't actually helpful (and definitely not in a press release).
Completely agree, that there are legitimate reasons to prefer comparison to e.g. deepeek models. But that doesn't change my point, we both agree that the comparisons would be extremely unfavorable.
3 replies →
Here's what I understood from the blog post:
- Mistral Large 3 is comparable with the previous Deepseek release.
- Ministral 3 LLMs are comparable with older open LLMs of similar sizes.
And implicit in this is that it compares very poorly to SOTA models. Do you disagree with that? Do you think these Models are beating SOTA and they did not include the benchmarks, because they forgot?
5 replies →
> I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release,
Why would they? They know they can't compete against the heavily closed-source models.
They are not even comparing against GPT-OSS.
That is absolutely and shockingly bearish.