Comment by hodgehog11

17 hours ago

I got to speak with some of the leads at Ollama and asked more or less this same question. The reason they abandoned llama.cpp is because it does not align with their goals.

llama.cpp is designed to rapidly adopt research-level optimisations and features, but the downside is that reported speeds change all the time (sometimes faster, sometimes slower) and things break really often. You can't hope to establish contracts with simultaneous releases if there is no guarantee the model will even function.

By reimplementing this layer, Ollama gets to enjoy a kind of LTS status that their partners rely on. It won't be as feature-rich, and definitely won't be as fast, but that's not their goal.

27 comments

hodgehog11

rjzzleep 13 hours ago

Georgi gave a response to some of the issues ollama has in the attached thread[1]

> Looking at ollama's modifications in ggml, they have too much branching in their MXFP4 kernels and the attention sinks implementation is really inefficient. Along with other inefficiencies, I expect the performance is going to be quite bad in ollama.

ollama responded to that

> Ollama has worked to correctly implement MXFP4, and for launch we've worked to validate correctness against the reference implementations against OpenAI's own. > Will share more later, but here is some testing from the public (@ivanfioravanti ) not done by us - and not paid or

leading to another response

> I am sure you worked hard and did your best. > But, this ollama TG graph makes no sense - speed cannot increase at larger context. Do you by any chance limit the context to 8k tokens? > Why is 16k total processing time less than 8k?

Whether or not Ollama's claim is right, I find this "we used your thing, but we know better, we'll share details later" behaviour a bit weird.

[1] https://x.com/ggerganov/status/1953088008816619637

Ycros 12 hours ago
ollama has always had a weird attitude towards upstream, and then they wonder why many in the community don't like them
- ignoramous 12 hours ago
  
  > they wonder why many in the community don't like them
  Do they? They probably care more about their "partners".
  As GP said:
  By reimplementing this layer, Ollama gets to enjoy a kind of LTS status that their partners rely on

jychang 16 hours ago

That's a dumb answer from them.

What's wrong with using an older well-tested build of llama.cpp, instead of reinventing the wheel? Like every linux distro ever who's ever ran into this issue?

Red Hat doesn't ship the latest build of the linux kernel to production. And Red Hat didn't reinvent the linux kernel for shits and giggles.

hodgehog11 16 hours ago
The Linux kernel does not break userspace.
> What's wrong with using an older well-tested build of llama.cpp, instead of reinventing the wheel?
Yeah, they tried this, this was the old setup as I understand it. But every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them. They said it happened more than once, but one particular case (wish I could remember what it was) was so bad they felt they had no choice but to reimplement. It's the lowest risk strategy.
- zozbot234 8 hours ago
  
  > Yeah, they tried this, this was the old setup as I understand it. But every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them.
  Shouldn't any such regressions be regarded as bugs in llama.cpp and fixed there? Surely the Ollama folks can test and benchmark the main models that people care about before shipping the update in a stable release. That would be a lot easier than trying to reimplement major parts of llama.cpp from scratch.
- tarruda 16 hours ago
  
  > every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them. They said it happened more than once, but one particular case (wish I could remember what it was) was so bad they felt they had no choice but to reimplement. It's the lowest risk strategy.
  A much lower risk strategy would be using multiple versions of llama-server to keep supporting old models that would break on newer llama.cpp versions.
  
  3 replies →

segmondy 2 hours ago

As someone that has participated in llama.cpp development, it's simple, Ollama doesn't want to give credit to llama.cpp. If llama.cpp went closed, Ollama would fall behind, they blatantly rip llama.cpp. Who cares tho? All they have to say is "powered by llama.cpp" It won't drive most users away from Ollama, most folks will prefer ollama and power users will prefer llama.cpp. But their ego won't let them.

On llama.cpp breaking things, that's the pace of innovation. It feels like a new model with a new architecture is being released every week. Guess what? The same things we saw with drivers for Unix systems back in the day, no documentation. So implementation is based on whatever can be figured from the arxiv paper, other implementations transformers/vllm (python -> C), quite often these models released from labs are "broken", jinja.template ain't easy! Bad templates will break the model generation, tool calling, agentic flow, etc. Folks will sometimes blame llama.cpp, sometimes the implementation is correct but the issue is that since it's main format is guff and anyone can generate a gguf, quite often experimental gguf is generated and released by folks excited to be the first to try a new model. Then llama.cpp gets the blame.

mijoharas 6 hours ago

> llama.cpp is designed to rapidly adopt research-level optimisations and features, but the downside is that reported speeds change all the time

Ironic that (according to the article) ollama rushed to implement GPT-OSS support, and thus broke the rest of the gguf quants (iiuc correctly).

buyucu 9 hours ago

this looks ok on paper, but isn't realized in reality. ollama is full of bugs, problems and issues llama.cpp has solved ages ago. this thread is a good example of that.

refulgentis 15 hours ago

This is a good handwave-y answer for them but truth is they've always been allergic to ever mentioning llama.cpp, even when legally required, they made a political decision instead of an engineering one, and now justify it to themselves and you by handwaving about it somehow being less stable than the core of it, which they still depend on.

A lot of things happened to get to the point they're getting called out aggressively in public on their own repo by nice people, and I hope people don't misread a weak excuse made in conversation as solid rationale, based on innuendo. llama.cpp has been just fine for me, running on CI on every platform you can think of, for 2 years.

EDIT: I can't reply, but, see anoncareer2012's reply.

hodgehog11 15 hours ago
It's clear you have a better handle on the situation than I do, so it's a shame you weren't the one to talk to them face-to-face.
> llama.cpp has been just fine for me.
Of course, so you really shouldn't use Ollama then.
Ollama isn't a hobby project anymore, they were the only ones at the table with OpenAI many months before the release of GPT-OSS. I honestly don't think they care one bit about the community drama at this point. We don't have to like it, but I guess now they get to shape the narrative. That's their stance, and likely the stance of their industry partners too. I'm just the messenger.
- anoncareer0212 14 hours ago
  
  > ...they were the only ones at the table with OpenAI many months before the release of GPT-OSS
  In the spirit of TFA:
  This isn't true, at all. I don't know where the idea comes from.
  You've been repeating this claim frequently. You were corrected on this 2 hours ago. llama.cpp had early access to it just as well.
  It's bizarre for several reasons:
  1. It is a fantasy that engineering involves seats at tables and bands of brothers growing from a hobby to a ???, one I find appealing and romantic. But, fantasy nonetheless. Additionally, no one mentioned or implied anything about it being a hobby or unserious.
  2. Even if it wasn't a fantasy, it's definitely not what happened here. That's what TFA is about, ffs.
  No heroics, they got the ultimate embarrassing thing that can happen to a project piggybacking on FOSS: ollama can't work with the materials OpenAI put out to help ollama users because llama.cpp and ollama had separate day 1 landings of code, and ollama has 0 path to forking literally the entire community to use their format. They were working so loosely with OpenAI that OpenAI assumed they were being sane and weren't attempting to use it as an excuse to force a community fork of GGUF and no one realized until after it shipped.
  3. I've seen multiple comments from you this afternoon spiking out odd narratives about Ollama and llama.cpp, that don't make sense at their face from the perspective of someone who also deps on llama.cpp. AFAICT you understood the GGML fork as some halcyon moment of freedom / not-hobbiness for a project you root for. That's fine. Unfortunately, reality is intruding, hence TFA. Given you're aware, it makes your humbleness re: knowing whats going on here sound very fake, especially when it precedes another rush of false claims.
  4. I think at some point you owe it to even yourself, if not the community, to take a step back and slow down on the misleading claims. I'm seeing more of a gish-gallop than an attempt to recalibrate your technical understanding.
  It's been almost 2 hours since you claimed you were sure there were multiple huge breakages due to bad code quality in llama.cpp, and here, we see you reframe that claim as a much weaker one someone else made to you vaguely.
  Maybe a good first step to avoiding information pollution here would be to invest time spent repeating other peoples technical claims you didn't understand, and find some of those breakages you know for sure happened, as promised previously.
  In general, I sense a passionate but youthful spirit, not an astro-turfer, and this isn't a group of professionals being disrespected because people still think they're a hobby project. Again, that's what the article is about.
  
  1 reply →

ozim 16 hours ago

Feels like BS I guess wrapping 2 or even more versions should not be that much of a problem.

There was drama that ollama doesn’t credit llama.cpp and most likely crediting it was „not aligning with their goals”.

otabdeveloper4 5 hours ago

> it does not align with their goals

Ollama is a scam trying to E-E-E the rising hype wave of local LLMs while the getting is still good.

Sorry, but somebody has to voice the elephant in the room here.

dpkirchner 1 hour ago
It'd be easy enough for ollama alternatives -- they just need to make a CLI front end that lets you run a model with reasonable efficiency without passing any flags. That's really ollama's value, as far as I can tell.
- otabdeveloper4 34 minutes ago
  
  Ollama itself doesn't pass that test. (Broken context settings, non-standard formats and crazy model names.)

A4ET8a8uTh0_v2 16 hours ago

Thank you. This is genuinely a valid reason even from a simple consistency perspective.

(edit: I think -- after I read some of the links -- I understand why Ollama comes across as less of a hero. Still, I am giving them some benefit of the doubt since they made local models very accessible to plebs like me; and maybe I can graduate to no ollama )

hodgehog11 16 hours ago
I think this is the thing: if you can use llama.cpp, you probably shouldn't use Ollama. It's designed for the beginner.
- otabdeveloper4 4 hours ago
  
  You shouldn't use Ollama as a beginner either. It comes with crazy begginer-hostile defaults out of the box.