Comment by jnmandal
8 days ago
I see a lot of hate for ollama doing this kind of thing but also they remain one of the easiest to use solutions for developing and testing against a model locally.
Sure, llama.cpp is the real thing, ollama is a wrapper... I would never want to use something like ollama in a production setting. But if I want to quickly get someone less technical up to speed to develop an LLM-enabled system and run qwen or w/e locally, well then its pretty nice that they have a GUI and a .dmg to install.
Thanks for the kind words.
Since the new multimodal engine, Ollama has moved off of llama.cpp as a wrapper. We do continue to use the GGML library, and ask hardware partners to help optimize it.
Ollama might look like a toy and what looks trivial to build. I can say, to keep its simplicity, we go through a deep amount of struggles to make it work with the experience we want.
Simplicity is often overlooked, but we want to build the world we want to see.
But Ollama is a toy, it's meaningful for hobbyists and individuals to use locally like myself. Why would it be the right choice for anything more? AWS, vLLM, SGLang etc would be the solutions for enterprise
I knew a startup that deployed ollama on a customers premises and when I asked them why, they had absolutely no good reason. Likely they did it because it was easy. That's not the "easy to use" case you want to solve for.
I can say trying many inference tools after the launch, many do not have the models implemented well, and especially OpenAI’s harmony.
Why does this matter? For this specific release, we benchmarked against OpenAI’s reference implementation to make sure Ollama is on par. We also spent a significant amount of time getting harmony implemented the way intended.
I know vLLM also worked hard to implement against the reference and have shared their benchmarks publicly.
Honestly, I think it just depends. A few hours ago I wrote I would never want it for a production setting but actually if I was standing something up myself and I could just download headless ollama and know it would work. Hey, that would also be fine most likely. Maybe later on I'd revisit it from a devops perspective, and refactor deployment methodology/stack, etc. Maybe I'd benchmark it and realize its fine actually. Sometimes you just need to make your whole system work.
We can obviously disagree with their priorities, their roadmap, the fact that the client isn't FOSS (I wish it was!), etc but no one can say that ollama doesn't work. It works. And like mchiang said above: its dead simple, on purpose.
3 replies →
> Ollama has moved off of llama.cpp as a wrapper. We do continue to use the GGML library
Where can I learn more about this? llama.cpp is an inference application built using the ggml library. Does this mean, Ollama now has it's own code for what llama.cpp does?
https://github.com/ollama/ollama/tree/main/model/models
This kind of gaslighting is exactly why I stopped using Ollama.
GGML library is llama.cpp. They are one and the same.
Ollama made sense when llama.cpp was hard to use. Ollama does not have value preposition anymore.
It’s a different repo. https://github.com/ggml-org/ggml
The models are implemented by Ollama https://github.com/ollama/ollama/tree/main/model/models
I can say as a fact, for the gpt-oss model, we also implemented our own MXFP4 kernel. Benchmarked against the reference implementations to make sure Ollama is on par. We implemented harmony and tested it. This should significantly impact tool calling capability.
Im not sure if im feeding here. We really love what we do, and I hope it shows in our product, in Ollama’s design and in our voice to our community.
You don’t have to like Ollama. That’s subjective to your taste. As a maintainer, I certainly hope to have you as a user one day. If we don’t meet your needs and you want to use an alternative project, that’s totally cool too. It’s the power of having a choice.
2 replies →
> GGML library is llama.cpp. They are one and the same.
Nope…
> I would never want to use something like ollama in a production setting.
We benchmarked vLLM and Ollama on both startup time and tokens per seconds. Ollama comes at the top. We hope to be able to publish these results soon.
you need to benchmark against llama.cpp as well.
Did you test multi-user cases?
Assuming this is equivalent to parallel sessions, I would hope so, this is like the entire point of vLLM
vllm and ollama assume different settings and hardware. Vllm backed by the paged attention expect a lot of requests from multiple users whereas ollama is usually for single user on a local machine.
It is weird but when I tried new gpt-oss:b20 model locally llama.cpp just failed instantly for me. At the same time under ollama it worked (very slow but anyway). I didn't find how to deal with llama.cpp but ollama definitely doing something under the hood to make models work.
> I would never want to use something like ollama in a production setting
If you can't get access to "real" datacenter GPUs for any reason and essentially do desktop, clientside deploys, it's your best bet.
It's not a common scenario, but a desktop with a 4090 or two is all you can get in some organizations.