Comment by tarruda

18 hours ago

I recently discovered that ollama no longer uses llama.cpp as a library, and instead they link to the low level library (ggml) which requires them to reinvent a lot of wheel for absolutely no benefit (if there's some benefit I'm missing, please let me know).

Even using llama.cpp as a library seems like an overkill for most use cases. Ollama could make its life much easier by spawning llama-server as a subprocess listening on a unix socket, and forward requests to it.

One thing I'm curious about: Does ollama support strict structured output or strict tool calls adhering to a json schema? Because it would be insane to rely on a server for agentic use unless your server can guarantee the model will only produce valid json. AFAIK this feature is implemented by llama.cpp, which they no longer use.

42 comments

tarruda

hodgehog11 17 hours ago

I got to speak with some of the leads at Ollama and asked more or less this same question. The reason they abandoned llama.cpp is because it does not align with their goals.

llama.cpp is designed to rapidly adopt research-level optimisations and features, but the downside is that reported speeds change all the time (sometimes faster, sometimes slower) and things break really often. You can't hope to establish contracts with simultaneous releases if there is no guarantee the model will even function.

By reimplementing this layer, Ollama gets to enjoy a kind of LTS status that their partners rely on. It won't be as feature-rich, and definitely won't be as fast, but that's not their goal.

rjzzleep 13 hours ago
Georgi gave a response to some of the issues ollama has in the attached thread[1]
> Looking at ollama's modifications in ggml, they have too much branching in their MXFP4 kernels and the attention sinks implementation is really inefficient. Along with other inefficiencies, I expect the performance is going to be quite bad in ollama.
ollama responded to that
> Ollama has worked to correctly implement MXFP4, and for launch we've worked to validate correctness against the reference implementations against OpenAI's own. > Will share more later, but here is some testing from the public (@ivanfioravanti ) not done by us - and not paid or
leading to another response
> I am sure you worked hard and did your best. > But, this ollama TG graph makes no sense - speed cannot increase at larger context. Do you by any chance limit the context to 8k tokens? > Why is 16k total processing time less than 8k?
Whether or not Ollama's claim is right, I find this "we used your thing, but we know better, we'll share details later" behaviour a bit weird.
[1] https://x.com/ggerganov/status/1953088008816619637
- Ycros 12 hours ago
  
  ollama has always had a weird attitude towards upstream, and then they wonder why many in the community don't like them
  
  1 reply →
jychang 16 hours ago
That's a dumb answer from them.
What's wrong with using an older well-tested build of llama.cpp, instead of reinventing the wheel? Like every linux distro ever who's ever ran into this issue?
Red Hat doesn't ship the latest build of the linux kernel to production. And Red Hat didn't reinvent the linux kernel for shits and giggles.
- hodgehog11 16 hours ago
  
  The Linux kernel does not break userspace.
  > What's wrong with using an older well-tested build of llama.cpp, instead of reinventing the wheel?
  Yeah, they tried this, this was the old setup as I understand it. But every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them. They said it happened more than once, but one particular case (wish I could remember what it was) was so bad they felt they had no choice but to reimplement. It's the lowest risk strategy.
  
  5 replies →
segmondy 2 hours ago

As someone that has participated in llama.cpp development, it's simple, Ollama doesn't want to give credit to llama.cpp. If llama.cpp went closed, Ollama would fall behind, they blatantly rip llama.cpp. Who cares tho? All they have to say is "powered by llama.cpp" It won't drive most users away from Ollama, most folks will prefer ollama and power users will prefer llama.cpp. But their ego won't let them.
On llama.cpp breaking things, that's the pace of innovation. It feels like a new model with a new architecture is being released every week. Guess what? The same things we saw with drivers for Unix systems back in the day, no documentation. So implementation is based on whatever can be figured from the arxiv paper, other implementations transformers/vllm (python -> C), quite often these models released from labs are "broken", jinja.template ain't easy! Bad templates will break the model generation, tool calling, agentic flow, etc. Folks will sometimes blame llama.cpp, sometimes the implementation is correct but the issue is that since it's main format is guff and anyone can generate a gguf, quite often experimental gguf is generated and released by folks excited to be the first to try a new model. Then llama.cpp gets the blame.
mijoharas 6 hours ago

> llama.cpp is designed to rapidly adopt research-level optimisations and features, but the downside is that reported speeds change all the time
Ironic that (according to the article) ollama rushed to implement GPT-OSS support, and thus broke the rest of the gguf quants (iiuc correctly).
buyucu 9 hours ago

this looks ok on paper, but isn't realized in reality. ollama is full of bugs, problems and issues llama.cpp has solved ages ago. this thread is a good example of that.
refulgentis 15 hours ago
This is a good handwave-y answer for them but truth is they've always been allergic to ever mentioning llama.cpp, even when legally required, they made a political decision instead of an engineering one, and now justify it to themselves and you by handwaving about it somehow being less stable than the core of it, which they still depend on.
A lot of things happened to get to the point they're getting called out aggressively in public on their own repo by nice people, and I hope people don't misread a weak excuse made in conversation as solid rationale, based on innuendo. llama.cpp has been just fine for me, running on CI on every platform you can think of, for 2 years.
EDIT: I can't reply, but, see anoncareer2012's reply.
- hodgehog11 15 hours ago
  
  It's clear you have a better handle on the situation than I do, so it's a shame you weren't the one to talk to them face-to-face.
  > llama.cpp has been just fine for me.
  Of course, so you really shouldn't use Ollama then.
  Ollama isn't a hobby project anymore, they were the only ones at the table with OpenAI many months before the release of GPT-OSS. I honestly don't think they care one bit about the community drama at this point. We don't have to like it, but I guess now they get to shape the narrative. That's their stance, and likely the stance of their industry partners too. I'm just the messenger.
  
  2 replies →
ozim 16 hours ago

Feels like BS I guess wrapping 2 or even more versions should not be that much of a problem.
There was drama that ollama doesn’t credit llama.cpp and most likely crediting it was „not aligning with their goals”.
otabdeveloper4 5 hours ago
> it does not align with their goals
Ollama is a scam trying to E-E-E the rising hype wave of local LLMs while the getting is still good.
Sorry, but somebody has to voice the elephant in the room here.
- dpkirchner 1 hour ago
  
  It'd be easy enough for ollama alternatives -- they just need to make a CLI front end that lets you run a model with reasonable efficiency without passing any flags. That's really ollama's value, as far as I can tell.
  
  1 reply →
A4ET8a8uTh0_v2 17 hours ago
Thank you. This is genuinely a valid reason even from a simple consistency perspective.
(edit: I think -- after I read some of the links -- I understand why Ollama comes across as less of a hero. Still, I am giving them some benefit of the doubt since they made local models very accessible to plebs like me; and maybe I can graduate to no ollama )
- hodgehog11 16 hours ago
  
  I think this is the thing: if you can use llama.cpp, you probably shouldn't use Ollama. It's designed for the beginner.
  
  1 reply →

arcanemachiner 17 hours ago

> I recently discovered that ollama no longer uses llama.cpp as a library, and instead they link to the low level library (ggml) which requires them to reinvent a lot of wheel for absolutely no benefit (if there's some benefit I'm missing, please let me know).

Here is some relevant drama on the subject:

https://github.com/ollama/ollama/issues/11714#issuecomment-3...

cwt137 15 hours ago

It is not true that Ollama doesn't use llama.cpp anymore. They built their own library, which is the default, but also really far from being feature complete. If a model is not supported by their library, they fall back to llama.cpp. For example, there is a group of people trying to get the new IBM models working with Ollama [1]. Their quick/short term solution is to bump the version of llama.cpp included with Ollama to a newer version that has support. And then at a later time, add support in Ollama's library.

1) https://github.com/ollama/ollama/issues/10557

wubrr 17 hours ago

> Does ollama support strict structured output or strict tool calls adhering to a json schema?

As far as I understand this is generally not possible at the model level. Best you can do is wrap the call in a (non-llm) json schema validator, and emit an error json in case the llm output does not match the schema, which is what some APIs do for you, but not very complicated to do yourself.

Someone correct me if I'm wrong

tarruda 17 hours ago
The inference engine (llama.CPP) has full control over the possible tokens during inference. It can "force" the llm to output only valid tokens so that it produces valid json
- kristjansson 16 hours ago
  
  and in fact leverages that control to constrain outputs to those matching user-specified BNFs
  https://github.com/ggml-org/llama.cpp/tree/master/grammars
  
  1 reply →
- wubrr 14 hours ago
  
  Ahh, I stand corrected, very cool!
mangoman 17 hours ago
no that's incorrect - llama.cpp has support for providing a context free grammar while sampling and only samples tokens that would conform to the grammar, rather than sampling tokens that would violate the grammar
- wubrr 14 hours ago
  
  Very interesting, thank you!
gcr 13 hours ago
This is misinformation. Ollama’s supported structured outputs that conform to a given JSON-schema for months. Here’s a post about this from last year: https://ollama.com/blog/structured-outputs
This is absolutely possible to do at the model level via logit shaping. Llama-cpp’s functionality for this is called GBNF. It’s tightly integrated into the token sampling infrastructure, and is what ollama builds upon for their json schema functionality.
- tarruda 6 hours ago
  
  > It’s tightly integrated into the token sampling infrastructure, and is what ollama builds upon for their json schema functionality.
  Do you mean the functionality of generating ebnf grammar and from a json schema use it for sampling is part of ggml, and all they have to do is use it?
  I assumed that this was part of llama.cpp, and another feature they have to re-implement and maintain.

halyconWays 17 hours ago

>(if there's some benefit I'm missing, please let me know).

Makes their VCs think they're doing more, and have more ownership, rather than being a do-nothing wrapper with some analytics and S3 buckets that rehost models from HF.

cdoern 17 hours ago

> Ollama could make its life much easier by spawning llama-server as a subprocess listening on a unix socket, and forward requests to it

I'd recommend taking a look at https://github.com/containers/ramalama its more similar to what you're describing in the way it uses llama-server, also it is container native by default which is nice for portability.