← Back to context

Comment by tarruda

18 hours ago

I recently discovered that ollama no longer uses llama.cpp as a library, and instead they link to the low level library (ggml) which requires them to reinvent a lot of wheel for absolutely no benefit (if there's some benefit I'm missing, please let me know).

Even using llama.cpp as a library seems like an overkill for most use cases. Ollama could make its life much easier by spawning llama-server as a subprocess listening on a unix socket, and forward requests to it.

One thing I'm curious about: Does ollama support strict structured output or strict tool calls adhering to a json schema? Because it would be insane to rely on a server for agentic use unless your server can guarantee the model will only produce valid json. AFAIK this feature is implemented by llama.cpp, which they no longer use.

I got to speak with some of the leads at Ollama and asked more or less this same question. The reason they abandoned llama.cpp is because it does not align with their goals.

llama.cpp is designed to rapidly adopt research-level optimisations and features, but the downside is that reported speeds change all the time (sometimes faster, sometimes slower) and things break really often. You can't hope to establish contracts with simultaneous releases if there is no guarantee the model will even function.

By reimplementing this layer, Ollama gets to enjoy a kind of LTS status that their partners rely on. It won't be as feature-rich, and definitely won't be as fast, but that's not their goal.

  • Georgi gave a response to some of the issues ollama has in the attached thread[1]

    > Looking at ollama's modifications in ggml, they have too much branching in their MXFP4 kernels and the attention sinks implementation is really inefficient. Along with other inefficiencies, I expect the performance is going to be quite bad in ollama.

    ollama responded to that

    > Ollama has worked to correctly implement MXFP4, and for launch we've worked to validate correctness against the reference implementations against OpenAI's own. > Will share more later, but here is some testing from the public (@ivanfioravanti ) not done by us - and not paid or

    leading to another response

    > I am sure you worked hard and did your best. > But, this ollama TG graph makes no sense - speed cannot increase at larger context. Do you by any chance limit the context to 8k tokens? > Why is 16k total processing time less than 8k?

    Whether or not Ollama's claim is right, I find this "we used your thing, but we know better, we'll share details later" behaviour a bit weird.

    [1] https://x.com/ggerganov/status/1953088008816619637

  • That's a dumb answer from them.

    What's wrong with using an older well-tested build of llama.cpp, instead of reinventing the wheel? Like every linux distro ever who's ever ran into this issue?

    Red Hat doesn't ship the latest build of the linux kernel to production. And Red Hat didn't reinvent the linux kernel for shits and giggles.

    • The Linux kernel does not break userspace.

      > What's wrong with using an older well-tested build of llama.cpp, instead of reinventing the wheel?

      Yeah, they tried this, this was the old setup as I understand it. But every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them. They said it happened more than once, but one particular case (wish I could remember what it was) was so bad they felt they had no choice but to reimplement. It's the lowest risk strategy.

      5 replies →

  • As someone that has participated in llama.cpp development, it's simple, Ollama doesn't want to give credit to llama.cpp. If llama.cpp went closed, Ollama would fall behind, they blatantly rip llama.cpp. Who cares tho? All they have to say is "powered by llama.cpp" It won't drive most users away from Ollama, most folks will prefer ollama and power users will prefer llama.cpp. But their ego won't let them.

    On llama.cpp breaking things, that's the pace of innovation. It feels like a new model with a new architecture is being released every week. Guess what? The same things we saw with drivers for Unix systems back in the day, no documentation. So implementation is based on whatever can be figured from the arxiv paper, other implementations transformers/vllm (python -> C), quite often these models released from labs are "broken", jinja.template ain't easy! Bad templates will break the model generation, tool calling, agentic flow, etc. Folks will sometimes blame llama.cpp, sometimes the implementation is correct but the issue is that since it's main format is guff and anyone can generate a gguf, quite often experimental gguf is generated and released by folks excited to be the first to try a new model. Then llama.cpp gets the blame.

  • > llama.cpp is designed to rapidly adopt research-level optimisations and features, but the downside is that reported speeds change all the time

    Ironic that (according to the article) ollama rushed to implement GPT-OSS support, and thus broke the rest of the gguf quants (iiuc correctly).

  • this looks ok on paper, but isn't realized in reality. ollama is full of bugs, problems and issues llama.cpp has solved ages ago. this thread is a good example of that.

  • This is a good handwave-y answer for them but truth is they've always been allergic to ever mentioning llama.cpp, even when legally required, they made a political decision instead of an engineering one, and now justify it to themselves and you by handwaving about it somehow being less stable than the core of it, which they still depend on.

    A lot of things happened to get to the point they're getting called out aggressively in public on their own repo by nice people, and I hope people don't misread a weak excuse made in conversation as solid rationale, based on innuendo. llama.cpp has been just fine for me, running on CI on every platform you can think of, for 2 years.

    EDIT: I can't reply, but, see anoncareer2012's reply.

    • It's clear you have a better handle on the situation than I do, so it's a shame you weren't the one to talk to them face-to-face.

      > llama.cpp has been just fine for me.

      Of course, so you really shouldn't use Ollama then.

      Ollama isn't a hobby project anymore, they were the only ones at the table with OpenAI many months before the release of GPT-OSS. I honestly don't think they care one bit about the community drama at this point. We don't have to like it, but I guess now they get to shape the narrative. That's their stance, and likely the stance of their industry partners too. I'm just the messenger.

      2 replies →

  • Feels like BS I guess wrapping 2 or even more versions should not be that much of a problem.

    There was drama that ollama doesn’t credit llama.cpp and most likely crediting it was „not aligning with their goals”.

  • > it does not align with their goals

    Ollama is a scam trying to E-E-E the rising hype wave of local LLMs while the getting is still good.

    Sorry, but somebody has to voice the elephant in the room here.

    • It'd be easy enough for ollama alternatives -- they just need to make a CLI front end that lets you run a model with reasonable efficiency without passing any flags. That's really ollama's value, as far as I can tell.

      1 reply →

  • Thank you. This is genuinely a valid reason even from a simple consistency perspective.

    (edit: I think -- after I read some of the links -- I understand why Ollama comes across as less of a hero. Still, I am giving them some benefit of the doubt since they made local models very accessible to plebs like me; and maybe I can graduate to no ollama )

It is not true that Ollama doesn't use llama.cpp anymore. They built their own library, which is the default, but also really far from being feature complete. If a model is not supported by their library, they fall back to llama.cpp. For example, there is a group of people trying to get the new IBM models working with Ollama [1]. Their quick/short term solution is to bump the version of llama.cpp included with Ollama to a newer version that has support. And then at a later time, add support in Ollama's library.

1) https://github.com/ollama/ollama/issues/10557

> Does ollama support strict structured output or strict tool calls adhering to a json schema?

As far as I understand this is generally not possible at the model level. Best you can do is wrap the call in a (non-llm) json schema validator, and emit an error json in case the llm output does not match the schema, which is what some APIs do for you, but not very complicated to do yourself.

Someone correct me if I'm wrong

  • no that's incorrect - llama.cpp has support for providing a context free grammar while sampling and only samples tokens that would conform to the grammar, rather than sampling tokens that would violate the grammar

  • This is misinformation. Ollama’s supported structured outputs that conform to a given JSON-schema for months. Here’s a post about this from last year: https://ollama.com/blog/structured-outputs

    This is absolutely possible to do at the model level via logit shaping. Llama-cpp’s functionality for this is called GBNF. It’s tightly integrated into the token sampling infrastructure, and is what ollama builds upon for their json schema functionality.

    • > It’s tightly integrated into the token sampling infrastructure, and is what ollama builds upon for their json schema functionality.

      Do you mean the functionality of generating ebnf grammar and from a json schema use it for sampling is part of ggml, and all they have to do is use it?

      I assumed that this was part of llama.cpp, and another feature they have to re-implement and maintain.

>(if there's some benefit I'm missing, please let me know).

Makes their VCs think they're doing more, and have more ownership, rather than being a do-nothing wrapper with some analytics and S3 buckets that rehost models from HF.

> Ollama could make its life much easier by spawning llama-server as a subprocess listening on a unix socket, and forward requests to it

I'd recommend taking a look at https://github.com/containers/ramalama its more similar to what you're describing in the way it uses llama-server, also it is container native by default which is nice for portability.