Comment by uyzstvqs
4 days ago
There's also Jan AI, which supports Linux, MCP, any Vulkan GPU, any Llama.cpp-compatible model, and optionally multiple cloud models as well. That seems like a better solution than this.
4 days ago
There's also Jan AI, which supports Linux, MCP, any Vulkan GPU, any Llama.cpp-compatible model, and optionally multiple cloud models as well. That seems like a better solution than this.
Choice is good but here is why prefer Ollama over others (I'm biased because I work on Ollama).
Supporting multiple backends is HARD. Originally, we thought we'd just add multiple backends to Ollama - MLX, ROCm, TRT-LLM, etc. It sounds really good on paper. In practice, you get into the lowest common denominator effect. What happens when you want to release Model A together with the model creator, and backend B doesn't support it? Do you ship partial support? If you do, then you start breaking your own product experience.
Supporting Vulkan for backwards compatibility on some hardware seems simple right? What if I told you in our testing, there is a portion of the supported hardware matrix getting -20% decrease in performance. What about just cherry picking which hardware to use Vulkan vs ROCm vs CUDA, etc? Do you start managing a long and tedious support matrix, where each time a driver is updated, the support may shift?
Supporting flash attention sounds simple too right? What if I told you over 20% of the hardware and for specific models, enabling it will cause non-trivial amount errors pertaining to specific hardware/model combinations? We are almost in a spot, where we can selectively enable flash attention per type of model architecture and hardware architecture.
It's so easy to add features, and hard to say no, but given any day, I will stand for a better overall product experience (at least to me since it's very subjective). No is temporary and yes is forever.
Ollama focuses on running the model the way the model creators intended. I know we get a lot of negativity on naming but often times, it's what we work with the model creators on naming (which surprisingly may or may not be how another platform named it on release). Overtime, I think this means more focus on top models to optimize more and add capabilities to augment the models.
Sure, those are all difficult problems. Problems that single devs are dealing with every day and figuring out. Why is it so hard for Ollama?
What seems to be true is that Ollama wants to be a solution that drives the narrative and wants to choose for its users rather than with them. It uses a proprietary model library, it built itself on llama.cpp and didn't upstream its changes, it converted the standard gguf model weights into some unusable file type that only worked with itself, etc.
Sorry but I don't buy it. These are not intractable problems to deal with. These are excuses by former docker creators looking to destroy another ecosystem by attempting to coopt it for their own gain.
^^^ absolutely spot on. There’s a big element of deception going on. I could respect it (and would trust the product more) if they were upfront about their motives and said “yes we are a venture backed startup and we have profit aspirations, but here’s XYZ thing we can promise. Instead it’s all smoke and mirrors … super sus.
Started with ollama, am at the stage of trying llama.ccp and realising there RPC just works, and ollama's promises of distributed runs is just hanging in the air, so indeed the convenience of ollama is starting to lose its appeal.
So, questions: what are the changes that they didn't upstream, is this listed somewhere? what is the impact? are they also changes in ggml? what was the point of the gguf format change?
> Supporting multiple backends is HARD. Originally, we thought we'd just add multiple backends to Ollama - MLX, ROCm, TRT-LLM, etc. It sounds really good on paper. In practice, you get into the lowest common denominator effect. What happens when you want to release Model A together with the model creator, and backend B doesn't support it? Do you ship partial support? If you do, then you start breaking your own product experience.
You conceptually divide your product to "universal experience" and "conditional experience". You add platform-specific things to the conditional experience, while keeping universal experience unified. I mean, do you even have a choice? The backend limits you, the only alternative you have is to change the backend upstream, which often times is the same as no alternative.
The only case where this is a real problem is when the backends are so different that the universal experience is not the main experience. But I don't think this is the case here?
GPUStack doesn't seem to have the problem of lowest common denominator but supports many architectures.
https://github.com/gpustack/gpustack