Comment by trouve_search

1 day ago

OK, I'm 100% rooting for both Mistral and task focused small models.

But Mistral has fall really far behind since 2025Q3. It seems they can't get good reasoning models working at even medium context sizes, which is necessary to be at the table right now.

Gemma4 and Qwen3.6 are currently best in the small size; Mistral's "small" model has ~4x the parameter count at 120B and isn't even competing with models a quarter its size.

Back one year ago with Mistral Small 3.1 they were keeping up, but they've fallen into irrelevancy right now.

If Mistral seriously wants to play the on-prem and small task-specific model game, a decent proxy would be to build models that get the r/localLlama crowd excited

I think it really depends on what you’re doing. I use mistral for many tasks in https://phrasing.app and they blow models many times their size out of the water.

None of my tasks use reasoning though (reasoning actually kills the performance) so perhaps that’s why. Still, I just had to rewrite my pipeline, and mistral was both faster, cheaper, and substantially better than any alternative

I agree. I am a paying Le Chat Pro user, really rooting for a European alternative. But the quality difference between Mistral and the frontier labs is growing too big to ignore. It’s worrying to me that they didn’t talk much about new models at the conference, because that is really where their focus should be IMHO.

I am wondering what is keeping them back, though: Money? Compute? Skills? Training data? My fear is that you are really only getting really good models by training on very dubious data (outputs from the frontier models etc) and that Mistral is too European and too enterprisey to take those risks.

  • My theory with no insider information: it’s a little of all of the above, but mostly money. To some extent, you can dig yourself out of a data hole with RL and a lot of compute. And you can buy a lot of compute and some data with a lot of money. Big labs have been operating in this regime for a while and it’s one of the drivers behind their costs beyond just scaling the weights and doing the actual training. Mistral just doesn’t have access to this level of compute or the money to try and muscle their way in.

  • > I am wondering what is keeping them back, though: Money? Compute? Skills? Training data?

    Not ruthless enough and no backing by a corrupt govt administration that has no morals but focuses on self-enrichment instead.

    Might sound drastic but I think that's actually closer to the truth thn everbody likes to admit.

    > My fear is that you are really only getting really good models by training on very dubious data (outputs from the frontier models etc) and that Mistral is too European and too enterprisey to take those risks.

    Exactly.

  • Should it, though?

    I think an European company, taking Chinese models, perhaps doing its own post-training on them and training the Chinese-ness out, with a great chat service, enterprise API and coding agent, could be pretty valuable in itself.

  • > I am wondering what is keeping them back, though: Money? Compute? Skills? Training data?

    Considering all their talk about new DCs and compute, and a few offhand comments, it sounded to me that compute is a big limitation.

  • > what is keeping them back, though: Money? Compute? Skills? Training data?

    All of the above and more. Everything holding Mistral back is the same thing that has held Europe back from competing in the entire digital revolution. See this 1991 article lamenting the loss of any viable European PC manufacturer: https://www.nytimes.com/1991/04/22/business/europe-stumbles-...

    Mistral being in Europe is disadvantaged with:

    1. Money: less diverse private pension fund environment = less LPs to invest in VC funds = less VC dollars to invest in new ventures. European money is vacuumed out of the private sector into state pension funds and dumped into low yielding government bonds. This starves the private sector of capital while inflating the % of GDP driven by government spending every year (government pension funds buying government bonds in circular fashion enable runaway deficit spending...just like circular AI infrastructure spending).

    2. Talent & compute: due to #1, Silicon Valley can outbid Europe for the best talent and hardware. Watch an OpenAI launch video and listen to all the European accents.

    3. Local market fragmentation: Europe is a collection of countries that pretend to work together while not even having a unified capital market. The average EU citizen can barely communicate with their neighbor in a common language beyond the level of a toddler (english fluency is massively overstated by Americans who only experience tourist capitals).

    4. Regulatory disadvantages: In everything from company regs, employee regs, unions, privacy regs, data portability regs, etc.

    It's not "culture" or Europeans being "lazy" as most people would claim. There's currently thousands of young french people working 80 hour weeks creating dumb consulting powerpoints or legacy investment banking deal memos as we speak. Ambitious people exist everywhere in equal proportion, they're just working on the wrong things.

    Europe can't compete in the digital revolution the same way they could compete in the industrial revolution due to various system design choices. Culture is simply the aesthetically observed byproducts of system design.

    • > 4. Regulatory disadvantages: In everything from company regs, employee regs, unions, privacy regs, data portability regs, etc.

      Agreed. My own anecdote: my company is global and for the past 6 months, we've been working on getting regulatory and legal approval for an LLM-based feature. The initial proposals of going live in all of our markets have been pared back to exclude Europe altogether due to the regulatory environment.

      When I took part in company-wide gen AI councils that reviewed new product rollouts, it seemed like there was a definite hesitation from higher ups from pushing out any leading edge features to European markets. And it's not that the regulations would necessarily block these features from going live but that they'd increase implementation costs to the point where it wouldn't be worth it.

    • >The average EU citizen can barely communicate with their neighbor in a common language beyond the level of a toddler (english fluency is massively overstated by Americans who only experience tourist capitals).

      Not true in my experience: even German waiters in small towns tend to have pretty fluent English.

      2 replies →

    • 1 and 2 are the same. Infinite money without barely any consequence because of 'reserve currency' privilege. To compete with that, the EU can't nuke the dollar because it would be suicide given the Eurodollar realities, and they can't anchor EU ip and talent because our politicians are too intertwined with globalist ideology and capital.

      1 reply →

    • > 2. Talent & compute: due to #1, Silicon Valley can outbid Europe for the best talent and hardware. Watch an OpenAI launch video and listen to all the European accents.

      There is definitely a lot of truth to that. Maybe a bit of an arbitrary measure, but these are the nationalites of the people that wrote the "Attention is all you need" paper. Pretty revealing I find:

      Ashish Vaswani: India

      Niki Parmar: India

      Jakob Uszkoreit: Germany

      Llion Jones: Wales (UK)

      Aidan Gomez: Canada

      Łukasz Kaiser: Poland

      Illia Polosukhin: Ukraine

      Noam Shazeer: USA

      1 reply →

    • You say that as if the American version of maximalist Capitalism is good or desirable to most people.

      Personally, I would much rather have good public pensions and health-care, than A.I agents.

      5 replies →

    • re: #4 Maybe it’s easier if you grow up in the system and know how to navigate the written and unwritten rules, but as a dual Canadian-American who recently gained Austrian citizenship, the regulatory friction is absolutely real. I decided to launch a new venture through an Austrian GmbH.

      There are supposedly streamlined paths for local residents, but I had to go through the standard corporate pipeline. I spent three months fighting a bizarre catch-22 between my notary (who cost €3k+) and the bank. To open the account, I had to prove I deposited €10k in capital. But I couldn't make the deposit without an active bank account. On top of that, the bank's compliance team kept arbitrarily canceling my application due to "incorrect answers"... refusing to tell me what the errors actually were and forcing me to restart the entire process ab initio.

      I finally just gave up. I wrote off the €3,000 notary fee and €1,000 in registered office costs as a sunk cost, and incorporated a US LLC instead. It took under 10 minutes, no notary, fees of $25 since I did it myself, plus another 20 minutes to open the business bank account.

      There was no commercial reason to choose Austria; it was purely sentimental. My ancestors were entrepreneurs in Linz and Vienna, and I loved the idea of renewing that legacy. But the sheer weight of the bureaucracy managed to kill about 99% of the early-stage startup enthusiasm you normally rely on to get a new project off the ground.

      1 reply →

> task focused small models

This is tangential: and forgive my ignorance here, but is there an inherent reason why there aren't smaller, focused models from the frontier model providers?

I'm thinking something like a software-specific subset of Opus that is the default for use in Claude Code. Smaller, cheaper to deploy and consume, maybe faster.

  • OpenAI used to make Codex-specific models, but they stopped. What I've gathered from interviews and similar is that training two models isn't worth the (small) lift from having a coding-specific model. You're pre-training on everything anyway, and coding RL is reasonably useful for general-purpose models too.

agreed, the next price increase from frontier labs (and the inevitable limits decrease in subscription tiers) will have people thinking real hard about their model providers and that's when mistral should be ready. however, given their recent performance, I realistically don't have my hopes high up.

  • Also, new Medium 3.5 is far more expensive than previous Mistral models, and much more expensive than e.g. Deepseek

    • I tried it out on some dev tasks with their Mistral Vibe subscription, and the performance was pretty okay (okay, not great), both in regards to development and speed. Worse than Anthropic's models I'm used to but at 20 EUR per month it wasn't a bad deal - except that the 200k context size would more or less be a deal breaker in many cases.

      3 replies →

    • Everything is more expensive than deepseek. They aren't frontier in intelligence but they are the frontier in cost per intelligence

> they've fallen into irrelevancy right now

It's a very charitable take, as Mistral has never really left the realm of irrelevancy.

It's only a matter of time before EU falls back to hosting Chinese models in EU datacenters.

Yeah. I run LLM models locally and for me 22B-32B is the largest I'm willing to invest in trying out.

Even though Mistral 4 has 6B active parameters per token (allowing 3-3.5 per token parameters to be loaded on a 4090), the ~240GB download + storage is pushing the limits of being able to try this out locally, especially if you are downloading and evaluating multiple models.

It also makes it harder for other people to make downstream finetunes like with what happened with the older Mistral/Magistral models.

  • I think machines like the DGX Spark are about to become a lot more common/popular. It’s big enough to run sparse 150-250B MoEs with enough throughout for a single user. Deepseek v4 Flash is #1 (in terms of usage) on OpenRouter because it’s good enough to be useful. You can run it on a Spark (though it runs better across 2, which is getting up there in cost)

I find Mistral Medium 3.5 with OpenCode is perfectly fine if you're willing to talk to it in a more fine-grained way about actual code. For me that's fine because even with huge frontier models I don't like trying to vibe prompt like a product manager.

I don't agree that they are falling behind. Using both chat and cli I get what I need and it's comparable to "sota" when I compare.

Mistral is entering the "let's extract has much money from EU taxpayers as we can" phase of European tech company which did not get bought by a US one.

They'll end like Dailymotion, just a zombie company.

Nobody trying to compete with Google, OpenAI, and Anthropic should be playing the small models / local models game.

Foundation model labs should be building very large reasoning models, then leaving it to the community to distill them down.

You can't scale a small model up, but you can scale a small model down.

I'm convinced the only way we'll have a seat at the table in the future and avoid total runaway takeoff is if there are very large models within 80% of the capabilities of the frontier models. Tiny RTX models do diddly squat to remain competitive.

Build open weights models for running on H200s. I'll spin them up on RunPod or Lambda.

  • I do think there's a chance open weight models have a bit of a moment with the costs of frontier models growing on business balance sheets. It's unfortunate from my "privacy loving" PoV that it's mostly Chinese models filling the gap. ( the top models on openrouter for instance ).

    I have used Mistral models out of pure ideology for web agents and the like which aren't doing a lot of heavy lifting.

    • Antirez’s Deepseek 4 Flash implementation that can run on MacBooks also was a revelation. It runs decently on M5 Max 128GB and it’s pointing out other bottlenecks like prefill speed which will improve.

  • I thought distillation meant small models don't have to compete with the big models and can always eventually achieve close parity, but it's just a matter of time to do the distillation? (i.e. how much lag do you want to live with) Am I oversimplifying?

    • There is likely a theoretical limit to how much intelligence you can pack into a model of a given size (especially when stretching that over a large input context size).

      Our evals are pretty complex so we only recently started testing ~30B class models, which are now becoming quite smart (on par with the frontier from 1 year ago). Mistral is far behind, but I'm rooting for them.

      Data at https://gertlabs.com/rankings

We actually found the Mistral Small 4, quantized to 4bit was comparable to Qwen 3.6 27B and is roughly the same size. At least from our experience on our use cases, the quantization of the Mistral model worked far better than trying to quantize the Qwen family.

Fully agree to your point though, Mistral in general is far behind where I'd expect and Qwen in particular is crushing it at the smaller sizes.

Personally, I'd consider anything 20B params and above a "medium" model. Small being <20B and large >100B. I think obviously we can get to the huge 1-2T param models, but frankly the margin of accuracy improvement for the speed hit is kinda insane (1-2% for many metrics).

  • It's all relative. For local use I'd classify it by hardware (VRAM size) using FP8 or Q6 quantization:

    1. tiny <2-3B -- easily runnable on lower-spec hardware

    2. small 4-8B -- runnable on 8GB GPUs

    3. medium 9-12B -- runnable on 12GB GPUs

    4. large 13-24B -- runnable on 16GB (for the lower end models) and 24GB GPUs

    5. very large 25-32GB -- runnable on 32GB GPUs

    6. huge >32GB -- not easily runnable on consumer GPUs without compromising performance (offloading layers to the CPU/RAM), quality (heavy quantization, esp. at <= Q4), or price (investing in multi-GPU setups and/or server-grade hardware).

    You could possibly split huge down further, as 70GB models (e.g. llama 3) are easier to get working than >120GB models and 1TB models are completely intractable.

    • As a Mac user:

      1. tiny <2-3B -- could run in a browser even, mac neo

      2. small 4-8B -- last of browser options, MacBook Air base

      3. medium 9-24B -- 32GB machine, air or pro notebook or mini

      4. large 25-48B -- 64GB, pro notebook or mini

      5. x-large 49-100B -- 128GB MacBook Pro or Studio

      6. Huge > 100B -- 256/512GB Mac Studio

      1 reply →

> a decent proxy would be to build models that get the r/localLlama crowd excited

I don’t really disagree with your post, but this is not exactly right. That subreddit seems to go from hype train to hype train every week, I haven’t found anything really insightful in it for quite a while now.