Llama.cpp supports Vulkan. why doesn't Ollama?

1 year ago (github.com)

234 comments

buyucu

So many here are trashing on Ollama, saying it's "just" nice porcelain around llama.cpp and it's not doing anything complicated. Okay. Let's stipulate that.

So where's the non-sketchy, non-for-profit equivalent? Where's the nice frontend for llama.cpp that makes it trivial for anyone who wants to play around with local LLMs without having to know much about their internals? If Ollama isn't doing anything difficult, why isn't llama.cpp as easy to use?

Making local LLMs accessible to the masses is an essential job right now—it's important to normalize owning your data as much as it can be normalized. For all of its faults, Ollama does that, and it does it far better than any alternative. Maybe wait to trash it for being "just" a wrapper until someone actually creates a viable alternative.

chown 1 year ago
I totally agree with this. I wanted to make it really easy for non-technical users with an app that hid all the complexities. I basically just wanted to embed the engine without making users open their terminal, let alone make them configure. I started with llama.cpp amd almost gave up on the idea before I stumbled upon Ollama, which made the app happen[1]
There are many flaws in Ollama but it makes many things much easier esp. if you don’t want to bother building and configuring. They do take a long time to merge any PRs though. One of my PRs has been waiting for 8 months and there was this another PR about KV cache quantization that took them 6 months to merge.
[1]: https://msty.app
- zozbot234 1 year ago
  
  > They do take a long time to merge any PRs though.
  I guess you have a point there, seeing as after many months of waiting we finally have a comment on this PR from someone with real involvement in Ollama - see https://github.com/ollama/ollama/pull/5059#issuecomment-2628... . Of course this is very welcome news.
  
  2 replies →
- smcleod 1 year ago
  
  That qkv PR was mine! Small world.
washadjeffmad 1 year ago
>So where's the non-sketchy, non-for-profit equivalent
llama.cpp, kobold.cpp, oobabooga, llmstudio, etc. There are dozens at this point.
And while many chalk the attachment to ollama up to a "skill issue", that's just venting frustration that all something has to do to win the popularity contest is to repackage and market it as an "app".
I prefer first-party tools, I'm comfortable managing a build environment and calling models using pytorch, and ollama doesn't really cover my use cases, so I'm not it's audience. I still recommend it to people who might want the training wheels while they figure out how not-scary local inference actually is.
- woadwarrior01 1 year ago
  
  > llmstudio
  ICYMI, you might want to read their terms of use:
  https://lmstudio.ai/terms
- evilduck 1 year ago
  
  > llama.cpp, kobold.cpp, oobabooga
  None of these three are remotely as easy to install or use. They could be, but none of them are even trying.
  > lmstudio
  This is a closed source app with a non-free license from a business not making money. Enshittification is just a matter of when.
  
  7 replies →
Aurornis 1 year ago
It’s so hard to decipher the complaints about ollama in this comment section. I keep reading comments from people saying they don’t trust it, but then they don’t explain why they don’t trust it and don’t answer any follow up questions.
As someone who doesn’t follow this space, it’s hard to tell if there’s actually something sketchy going on with ollama or if it’s the usual reactionary negativity that happens when a tool comes along and makes someone’s niche hobby easier and more accessible to a broad audience.
- bloomingkales 1 year ago
  
  they don’t explain why they don’t trust it
  We need to know a few things:
  1) Show me the lines of code that log things and how it handles temp files and storage.
  2) No remote calls at all.
  3) No telemetry at all.
  This is the feature list I would want to begin trusting. I use this stuff, but I also don’t trust it.
  
  3 replies →
traverseda 1 year ago
>So where's the non-sketchy, non-for-profit equivalent?
Serving models is currently expensive. I'd argue that some big cloud providers have conspired to make egress bandwidth expensive.
That, coupled with the increasing scale of the internet, make it harder and harder for smaller groups to do these kinds of things. At least until we get some good content addressed distributed storage system.
- woadwarrior01 1 year ago
  
  > Serving models is currently expensive. I'd argue that some big cloud providers have conspired to make egress bandwidth expensive.
  Cloudflare R2 has unlimited egress, and AFAIK, that's what ollama uses for hosting quantized model weights.
buyucu 1 year ago
supporting vulkan will help ollama reach the masses who don't have dedicated gpus from nvidia.
this is such a low hanging fruit that it's silly how they are acting.
- lolinder 1 year ago
  
  As has been pointed out in this thread in a comment that you replied to (so I know you saw it) [0], Ollama goes to a lot of contortions to support multiple llama.cpp backends. Yes, their solution is a bit of a hack, but it means that the effort to adding a new back end is substantial.
  And again, they're doing those contortions to make it easy for people. Making it easy involves trade-offs.
  Yes, Ollama has flaws. They could communicate better about why they're ignoring PRs. All I'm saying is let's not pretend they're not doing anything complicated or difficult when no one has been able to recreate what they're doing.
  [0] https://news.ycombinator.com/item?id=42886933
  
  9 replies →
bestcoder69 1 year ago
Llamafile: https://github.com/Mozilla-Ocho/llamafile
- lolinder 1 year ago
  
  Llamafile is great but solves a slightly different problem very well: how do I easily download and run a single model without having any infrastructure in place first?
  Ollama solves the problem of how I run many models without having to deal with many instances of infrastructure.
  
  2 replies →
- homebrewer 1 year ago
  
  It's actually more difficult to use on linux (compared to ollama) because of the weird binfmt contortions you have to go through.
  
  1 reply →
axegon_ 1 year ago
I think you are missing the point. To get things straight: llama.cpp is not hard to setup and get running. It was a bit of a hassle in 2023 but even then it was not catastrophically complicated if you were willing to read the errors you were getting. People are dissatisfied for two, very valid reasons: ollama gives little to no credit to llama.cpp. The second one is the point of the post: a PR has been open for over 6 months and not a huge PR at that has been completely ignored. Perhaps the ollama maintainers personally don't have use for it so they shrugged it off but this is the equivalent of "it works on my computer". Imagine if all kernel devs used Intel CPUs and ignored every non-intel CPU-related PR. I am not saying that the kernel mailing list is not a large scale version of a countryside pub on a Friday night - it is. But the maintainers do acknowledge the efforts of people making PRs and do a decent job at addressing them. While small, the PR here is not trivial and should have been, at the very least, discussed. Yes, the workstation/server I use for running models uses two Nvidia GPU's. But my desktop computer uses an Intel Arc and in some scenarios, hypothetically, this pr might have been useful.
- lolinder 1 year ago
  
  > To get things straight: llama.cpp is not hard to setup and get running. It was a bit of a hassle in 2023 but even then it was not catastrophically complicated if you were willing to read the errors you were getting.
  It's made a lot of progress in that the README [0] now at least has instructions for how to download pre-built releases or docker images, but that requires actually reading the section entitled "Building the Project" to realize that it provides more than just building instructions. That is not accessible to the masses, and it's hard for me to not see that placement and prioritization as an intentional choice to be inaccessible (which is a perfectly valid choice for them!)
  And that's aside from the fact that Ollama provides a ton of convenience features that are simply missing, starting with the fact that it looks like with llama.cpp I still have to pick a model at startup time, which means switching models requires SSHing into my server and restarting it.
  None of this is meant to disparage llama.cpp: what they're doing is great and they have chosen to not prioritize user convenience as their primary goal. That's a perfectly valid choice. And I'm also not defending Ollama's lack of acknowledgment. I'm responding to a very specific set of ideas that have been prevalent in this thread: that not only does Ollama not give credit, they're not even really doing very much "real work". To me that is patently nonsense—the last mile to package something in a way that is user friendly is often at least as much work, it's just not the kind of work that hackers who hang out on forums like this appreciate.
  [0] https://github.com/ggerganov/llama.cpp
- portaouflop 1 year ago
  
  llama.ccp is hard to set up - I develop software for a living and it wasn’t trivial for me. ollama I can give to my non-technical family members and they know how to use it.
  As for not merging the PR - why are you entitled to have a PR merged? This attitude of entitlement around contributions is very disheartening as oss maintainer - it’s usually more work to review/merge/maintain a feature etc than to open a PR. Also no one is entitled to comments / discussion or literally one second of my time as an OSS maintainer. This is imo the cancer that is eating open source.
  
  1 reply →
pepijndevos 1 year ago

ramalama seems to be trying, it's a docker based approach.
airstrike 1 year ago
Here you go: https://github.com/hecrj/icebreaker
- lolinder 1 year ago
  
  > No pre-built binaries yet! Use cargo to try it out
  Not an equivalent yet, sorry.

buyucu 1 year ago

llama.cpp has supported vulkan for more than a year now. For more than 6 months now there has been an open PR to add vulkan backend support for Ollama. However, Ollama team has not even looked at it or commented on it.

Vulkan backends are existential for running LLMs on consumer hardware (iGPUs especially). It's sad to see Ollama miss this opportunity.

Kubuxu 1 year ago
Don’t be sad for commercial entity that is not a good player https://github.com/ggerganov/llama.cpp/pull/11016#issuecomme...
- andy_ppp 1 year ago
  
  This is great, I did not know about RamaLama and I'll be using and recommending that in future and if I see people using Ollama in instructions I'll recommend they move to RamaLama in the future. Cheers.
  
  13 replies →
- bearjaws 1 year ago
  
  It's hilarious that docker guys are trying to take another OSS and monetize it. Hey if it worked once?...
- buyucu 1 year ago
  
  I was not aware of this context, thanks!
n144q 1 year ago

Thanks, just yesterday I discovered that Ollama could not use iGPU on my AMD machine, and was going through a long issue for solutions/workarounds (https://github.com/ollama/ollama/issues/2637). Existing instructions are based on Linux, and some people found it utterly surprising that anyone wants to run LLMs on Windows (really?). While I would have no trouble installing Linux and compile from source, I wasn't ready to do that to my main, daily-use computer.
Great to see this.
PS. Have you got feedback on whether this works on Windows? If not, I can try to create a build today.
zozbot234 1 year ago
The PR has been legitimately out-of-date and unmergeable for many months. It was forward-ported a few weeks ago, and is now still awaiting formal review and merging. (To be sure, Vulkan support in Ollama will likely stay experimental for some time even if the existing PR is merged, and many setups will need manual adjustment of the number of GPU layers and such. It's far from 100% foolproof even in the best-case scenario!)
For that matter, some people are still having issues building and running it, as seen from the latest comments on the linked GitHub page. It's not clear that it's even in a fully reviewable state just yet.
- buyucu 1 year ago
  
  this pr was reviewable multiple times, rebased multiple times. all because ollama team kept ignoring it. it has been open for almost 7 months now without a single comment from the ollama folks.
- ecurtin 1 year ago
  
  It's gets out of date with conflicts, etc. Because it's ignored, if this was the upstream project of Ollama, llama.cpp the maintainers would have got this merged months ago.
9cb14c1ec0 1 year ago
The PR at issue here blocks iGPUs. My fork of the PR changes removes that:
https://github.com/9cb14c1ec0/ollama-vulkan
I successfully ran Phi4 on my AMD Ryzen 7 PRO 5850U iGPU with it.
- buyucu 1 year ago
  
  this is great! I think pufferfish is taking PRs to his fork as well.

Havoc 1 year ago

ollama was good initially in that it made LLMs more accessible for non-technical people while everyone was figuring things out.

Lately they seem to be contributing mostly confusion to the conversation.

The #1 model the entire world is talking about is literally mislabeled their side. There is no such thing as R1-1.5b. Quantization without telling users also confuses noobs as to what is possible. Setting up an api different from the thing they're wrapping adds chaos. And claiming each feature added llama.cpp as something "ollama now supports" is exceedingly questionable especially when combined with the very sparse acknowledgement that it's a wrapper at all.

Whole thing just doesn't have good vibes

dingocat 1 year ago
What do you mean there is no such thing as R1-1.5b? DeepSeek released a distilled version based on a 1.5B Qwen model with the full name DeepSeek-R1-Distill-Qwen-1.5B, see chapter 3.2 on page 14 of their research article [0].
[0] https://arxiv.org/abs/2501.12948
- trissi1996 1 year ago
  
  Which is not the same model, it's not R1 it's R1-Distill-Qwen-1.5B....
  
  3 replies →
- Havoc 1 year ago
  
  ollama labels the qwen models R1, while the "R1" moniker standing on its own in deepseek world means the full model that has nothing to do with qwen.
  https://ollama.com/library/deepseek-r1
  That may have been ok if it was just same model at different sizes but they're completely different things here & it's created confusion out of thin air for absolutely no reason other than ollama being careless.
  
  1 reply →

the_mitsuhiko 1 year ago

Ollama needs competition. I’m not sure what drives the people that maintain it but some of their actions imply that there are ulterior motives at play that do not have the benefit of their users in mind.

However such projects require a lot of time and effort and it’s not clear if this project can be forked and kept alive.

Deathmax 1 year ago
The most recent one of the top of my head is their horrendous aliasing of DeepSeek R1 on their model hub, misleading users into thinking they are running the full model but really anything but the 671b alias is one of the distilled models. This has already led to lots of people claiming that they are running R1 locally when they are not.
- TeMPOraL 1 year ago
  
  The whole DeepSeek-R1 situation gets extra confusing because:
  - The distilled models are also provided by DeepSeek;
  - There's also dynamic quants of (non-distilled) R1 - see [0]. Those, as I understand it, are more "real R1" than the distilled models, and you can get as low as ~140GB file size with the 1.58-bit quant.
  I actually managed to get the 1.58-bit dynamic quant running on my personal PC, with 32GB RAM, at about 0.11 tokens per second. That is, roughly six tokens per minute. That was with llama.cpp via LM Studio; using Vulkan for GPU offload (up to 4 layers for my RTX 4070 Ti with 12GB VRAM :/) actually slowed things down relative to running purely on the CPU, but either way, it's too slow to be useful with such specs.
  --
  [0] - https://unsloth.ai/blog/deepseekr1-dynamic
  
  1 reply →
- adastra22 1 year ago
  
  I'm not sure that's fair, given that the distilled models are almost as good. Do you really think Deepseek's web interface is giving you access to 671b? They're going to be running distilled models there too.
  
  6 replies →
blixt 1 year ago
LM Studio has been around for a long time and does a lot of similar things but with a more UI-based approach. I used to use it before Ollama, and seems it's still going strong. https://lmstudio.ai/
- buyucu 1 year ago
  
  isn't lm stuido closed source?
7thpower 1 year ago
Can you please explain why you think they may be operating in bad faith?
- diggan 1 year ago
  
  Not parent, but same feeling.
  First I got the feeling because of how they store things on disk and try to get all models rehosted in their own closed library.
  Second time I got the feeling is when it's not obvious at all about what their motives are, and that it's a for-profit venture.
  Third time is trying to discuss things in their Discord and the moderators there constantly shut down a lot of conversation citing "Misinformation" and rewrites your messages. You can ask a honest question, it gets deleted and you get blocked for a day.
  Just today I asked why the R1 models they're shipping that are the distilled ones, doesn't have "distilled" in the name, or even any way of knowing which tag is which model, and got the answer "if you don't like how things are done on Ollama, you can run your own object registry" which doesn't exactly inspire confidence.
  Another thing I noticed after a while is that there are bunch of people with zero knowledge of terminals that want to run Ollama, even though Ollama is a project for developers (since you do need to know how to run a terminal). Just making the messaging clearer would help a lot in this regarding, but somehow the Ollama team thinks thats gatekeeping and it's better to teach people basic terminal operations.
  
  3 replies →
justinmayer 1 year ago

Benefiting users is definitely not Ollama’s first priority, as seen when this pull request was summarily closed: https://github.com/jmorganca/ollama/pull/395
Those README changes only served to provide greater transparency to would-be users.
Ulterior motives, indeed.
prabir 1 year ago
There is https://cortex.so/ that I’m looking forward too.
- adastra22 1 year ago
  
  Hey thanks, I didn't know about cortex and this looks perfect.
imtringued 1 year ago

Ollama doesn't really need competition. Llama.cpp just needs a few usability updates to the gguf format so that you can specify a hugging face repository like you can do in vLLM already.
buyucu 1 year ago

I totally agree that ollama needs competition. They have been doing very sketchy things lately. I wish llama.cpp had an alternative wrapper client like ollama.
Liquix 1 year ago

agreed. but what's wrong with Jan? does ollama utilize resources/run models more efficiently under the hood? (sorry for the naivete)

benxh 1 year ago

My biggest gripe with Ollama is the badly named models, e.g. under deepseek-r1, it defaults to the distill models.

buyucu 1 year ago
I agree they should rename them.
But defaulting to a 671b model is also evil.
- rfoo 1 year ago
  
  No. If you can't run it and most people can never run the model on their laptop, it's fine, let people know the fact, instead of giving them illusion.
  
  3 replies →

trash_cat 1 year ago

I use Ollama because I am a casual user and can't be bothered to read the docs on how to setup llama.cpp. I just want to run a simple llm locally.

Why would I care about Vulkan?

buyucu 1 year ago
with vulkan it runs much much faster on consumer hardware, especially opn igpus like intel or amd.
- zozbot234 1 year ago
  
  Well, it definitely runs faster on external dGPU's. With iGPU's and possibly future NPU's, the pre-processing/"thinking" phase is much faster (because that one is compute-bound) but text generation tends to be faster on CPU because it makes better use of available memory bandwidth (which is the relevant constraint there). iGPU's and NPU's will still be a win wrt. energy use, however.
- bdhcuidbebe 1 year ago
  
  For Intel, OpenVINO should be the preferred route. I dont follow AMD, but Vulkan is just the common denominator here.
  
  5 replies →
- sebazzz 1 year ago
  
  How is the performance of Vulkan vs ROCm on AMD iGPUs? Ollama can be persuaded to run on iGPUs with ROCm.

a12k 1 year ago

Ollama is sketchy enough that I run it in a VM. Which is odd because it would probably take less effort to just run Llama.cpp directly, but VMs are pretty easy so just went that route.

When I see people bring up the sketchiness most of the time the creator responds with the equivalent of shrugs, which imo increases the sketchiness.

nialv7 1 year ago
It's fully open source. I mean yes it uses llama.cpp without giving it credit. But why run it in a VM?
- a12k 1 year ago
  
  It severely over-permissions itself on my Mac.
  
  4 replies →
- instagary 1 year ago
  
  Isn't there a clause in MIT that says you're required to give credit? Also, I didn't know a YC company which started it: https://www.ycombinator.com/companies/ollama.
  
  2 replies →
- krowek 1 year ago
  
  > But why run it in a VM?
  Because you don't execute untrusted code in your machine without containerization/virtualization. Don't you?
  
  2 replies →
n144q 1 year ago

Care to elaborate what "sketchy" refers to here?
nicce 1 year ago
> but VMs are pretty easy so just went that route.
Don’t you need at least 2 GPUs in that case and put kernel level passthrough?
- a12k 1 year ago
  
  I don’t use GPU. Works fine, but the large Mixtral models are slow.
- bdhcuidbebe 1 year ago
  
  i pass through my dGPU to VM and use iGPU for desktop
buyucu 1 year ago
ollama advertising llama.cpp features as their own is very dishonest in my opinion.
- portaouflop 1 year ago
  
  That’s the curse and blessing of open source I guess? I have billion dollar companies running my oss software without giving me anything - but do I gripe about it in public forums? Yea maybe sometimes but it never helps to improve the situation.
  
  5 replies →
- adastra22 1 year ago
  
  Welcome to open source.

mschwaig 1 year ago

Ollama tries to appeal to a lowest common denominator user base, who does not want to worry about stuff like configuration and quants, or which binary to download.

I think they want their project to be smart enough to just 'figure out what to do' on behalf of the user.

That appeals to a lot of people, but I think them stuffing all backends into one binary and auto-detecting at runtime which to use and is actually a step too far towards simplicity.

What they did to support both CUDA and ROCm using the same binary looked quite cursed last time I checked (because they needed to link or invoke two different builds of llama.cpp of course).

I have only glanced at that PR, but I'm guessing that this plays a role in how many backends they can reasonably try to support.

In nixpkgs it's a huge pain that we configure quite deliberately what we want Ollama to do at build time, and then Ollama runs off and does whatever anyways, and users have to look at log output and performance regressions to know what it's actually doing, every time they update their heuristics for detecting ROCm. It's brittle as hell.

buyucu 1 year ago

I disagree with this, but it's a reasonable argument. The problem is that the Ollama team has basically ignored the PR, instead of engaging the community. The least they can do is to explain their reasoning.
This PR is #1 on their repo based on multiple metrics (comments, iterations, what have you)

your_challenger 1 year ago

I don't know why one would use Ollama instead of llama.cpp. llama.cpp is so easy to use and the maintainer is pretty famous and active in the community.

buyucu 1 year ago
Llama.cpp dropped support for multimodal vlms. That is why I am using ollama. I would happily switch back if I could.
- Gracana 1 year ago
  
  llama.cpp readme still lists multimodal models.. Qwen2-VL and others. Is that inaccurate, or something different?
  [edit] Oh I see, here's an issue about it: https://github.com/ggerganov/llama.cpp/issues/8010
  
  1 reply →

av_conk 1 year ago

I tried using ollama because I couldn't get ROCm working on my system with llama-cpp. Ollama bundles the ROCm libraries for you. I got around 50 tokens per second with that setup.

I tried llama-cpp with the Vulkan backend and doubled the amount of tokens per second. I was under the impression ROCm is superior to Vulkan, so I was confused about the result.

In any case, I've stuck with llama-cpp.

buyucu 1 year ago

It depends on your GPU. Vulkan is well-supported by essentially all GPUs. AMD support ROCm well for their datacenter GPUs, but support for consumer hardware has not been as good.

paradite 1 year ago

Could it be that supporting multiple platforms open up more support tickets and adds more work to keep the software working on those new platforms?

As someone who built apps for Windows, Linux, macOS, iOS and Android, it is not trivial to ensure your new features or updates work on all platforms, and you have to deal with deprecations.

geerlingguy 1 year ago

They already support ROCm, which probably introduces 10x more support requests than Vulkan would!
buyucu 1 year ago
ollama is not doing anything. llama cpp does all that work. ollama is just a small wrapper on top.
- zozbot234 1 year ago
  
  This is not quite correct. Ollama must assess the state of Vulkan support and amount of available memory, then pick the fraction of the model to be hosted on GPU. This is not totally foolproof and will likely always need manual adjustment in some cases.
  
  4 replies →
- paradite 1 year ago
  
  Ok assuming what you said is correct, why wouldn't Ollama then be able to support Vulkan by default out of the box?
  Sorry I'm not sure what's the relationship exactly between the two projects. This is a genuine questions, not a troll question.
  
  5 replies →

turnsout 1 year ago

This is going to sound like a troll, but it's an honest question: Why do people use Ollama over llama.cpp? llama.cpp has added a ton of features, is about as user-friendly as Ollama, and is higher-performance. Is there some key differentiator for Ollama that I'm missing?

SkyPuncher 1 year ago
Ollama - `brew install ollama`
llama.cpp - Read the docs, with loads of information and unclear use cases. Question if it has API compatibility and secondary features that a bunch of tools expect. Decide it's not worth your effort when `ollama` is already running by the time you've read the docs
- kgwgk 1 year ago
  
  https://formulae.brew.sh/formula/llama.cpp
- LorenDB 1 year ago
  
  Additionally, Ollama makes model installation a single command. With llama.cpp, you have to download the raw models from Huggingface and handle storage for them yourself.
  
  2 replies →
- singularity2001 1 year ago
  
  ollama run deepseek-r1:14b
portaouflop 1 year ago
I can only speak for myself but to me llama.ccp looks kind of hard to use (tbh never tried to use it), whereas ollama was just one cli command away. Also I had no idea that its equivalent, I thought llama.ccp is some experimental tool for hardcore llm cracks, not something that I can teach my for example my non-technical mom to use.
Looking at the repo of llama.ccp it’s still not obvious to me how to use it without digging in - I need to download models from huggingface it seems and configure stuff etc - with ollama I type ollama get or something and it works.
Tbh I don’t just that stuff a lot or even seriously, maybe once per month to try out new local models.
I think having an easy to use quickstart would go a long way for llama.ccp - but maybe it’s not intended for casual (stupid?) users like me…
- Majromax 1 year ago
  
  In my mind, it doesn't help that llama.cpp's name is that of a source file. Intuitively, that name screams "library for further integration," not "tool for end-user use."
n144q 1 year ago

https://news.ycombinator.com/item?id=40693391
(I recommend doing a search yourself first)
Basically, if you know how to use a computer, you can use Ollama (almost). You can't say the same thing about llama.cpp. Not everyone knows how to build from source, or even what "build" means.
paradite 1 year ago
For starters:
- It doesn't have a website
- It doesn't have a download page, you have to build it yourself
- woadwarrior01 1 year ago
  
  > - It doesn't have a download page, you have to build it yourself
  I'd wager that anyone capable enough to run a command line tool like Ollama should also be able to download prebuilt binaries from the llama.cpp releases page[1]. Also, prebuilt binaries are available on things like homebrew[2].
  [1]: https://github.com/ggerganov/llama.cpp/releases
  [2]: https://formulae.brew.sh/formula/llama.cpp
  
  8 replies →
mrkeen 1 year ago
I used both. I had a terrible time with llama, and did not realise it until I used ollama.
I owned an RTX2070, and followed the llama instructions to make sure it was compiling with GPU enabled. I then hand-tweaked settings (numgpulayers) to try to make it offload as much as possible to the GPU. I verified that it was using a good chunk of my GPU ram (via nvidia-smi), and confirmed that with-gpu was faster than cpu-only. It was still pretty slow, and influenced my decision to upgrade to an RTX3070. It was faster, but still pretty meh...
The first time I used ollama, everything just worked straight out of the box, with one command and zero configuration. It was lightning fast. Honestly if I'd had ollama earlier, I probably wouldn't have felt the need to upgrade GPU.
- serial_dev 1 year ago
  
  Maybe it was lightning fast because the model names are misleading? I installed it to try out deepseek, I was surprised how small the download artifact was and how easily it ran on my simple 3 years old Mac. I was a bit disappointed as deepseek gave bad responses and I heard it should be better than what I used on OpenAI… only to then realize after reading it on Twitter that I got a very small version of deepseek r1.
  Maybe you were running a different model?
- bildung 1 year ago
  
  If it was faster with ollama, then you most probably just downloaded a different model (hard to recognize with ollama). Ollama only adds UX to llama.cpp, and nothing compute-wise.
stuaxo 1 year ago

The server in llama-cpp is documented as being only for demonstration, but ollama supports it as a model to run it.
For work, we are given Macs and so the GPU can't be passed through to docker.
I wanted a client/server where the server has the LLM and runs outside of Docker, but without me having to write the client/server part.
I run my model in ollama, then inside the code use litellm to speak to it during local development.
rakatata 1 year ago

While not rocketscience, a lot of its features requires to know how to recompile the project with passing certain variables. Also you need to properly format prompts for each instructor model.
buyucu 1 year ago

I use ollama because llama.cpp dropped support for vlms. I would happily switch back if llama.cpp starts supporting vlms again.
dinosaurdynasty 1 year ago

Can you even use bare llama.cpp with OpenWebUI? Especially when they are running on two different computers?
zophiana 1 year ago
Honestly I just didn't know it was this easy to use, maybe because of the name... But ramalama seems to be a full replacement for ollama
- himhckr 1 year ago
  
  ramalama still needs users to be able to install docker first, no? That’s a barrier to entry for many users esp. Windows where I have had my struggles running Docker not to mention a massive resource hog.
  
  1 reply →

wkat4242 1 year ago

That's a weird thing about Ollama yes.

It took very long for them to support KV cache quantisation too (which drastically reduces the amount of VRAM needed for context!). Even though the underlying llama.cpp had offered it for ages. And they had it handed to them on a platter, someone had developed everything and submitted a patch.

The developer of that patch even was about to give up as he had to constantly keep it up to date with upstream even though he was constantly being ignored. So he had no idea if it would ever be merged.

They just seem to be really hesitant to offer new features.

Eventually it was merged and it made a huge difference to people with low VRAM cards.

quibono 1 year ago

Is Ollama just the porcelain around llama.cpp? Or is there more to it than that?

buyucu 1 year ago

yes, it's a convenience wrapper around llama.cpp
diggan 1 year ago

They also decided to rehost the model files in their own (closed) library/repository + store the files split into layers on disk, so you cannot easily reuse model-files between applications. I think the point is that models can share layers, I'm not sure how much space you actually save, I just know that if you use both LM Studio + Ollama you cannot share models but if you use LM Studio + llama.cpp you can share the same files between them, no need to download duplicate model weights.
ac29 1 year ago

The main feature IMO is the model library. llama.cpp on its own does not come with any built in way to download and manage models.

colorant 1 year ago

There is ipex-llm support for Ollama on Intel GPU (https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...)

arvigeus 1 year ago

Finally a response from the maintainers: https://github.com/ollama/ollama/pull/5059#issuecomment-2628...

I am the least qualified person to comment on that, but honestly their response made me to raise an eyebrow.

cedws 1 year ago

The way Ollama has basically been laundering llama.cpp’s features as its own felt dodgy, this appears to confirm there’s something underhanded going on.

buyucu 1 year ago
I did not assume the worst when submitting the post, but that is also my suspicion. The whole thing is very dodgy.
- bloomingkales 1 year ago
  
  Are closer to the metal AI developers an under tracked bottle neck? AMD and Intel can barely get off the ground due to lagging software developers.
  
  5 replies →
davely 1 year ago

I think it's important to bring up the face that llama.cpp has an MIT license[0]. Notably, the MIT license "permits reuse within proprietary software, provided that all copies of the software or its substantial portions include a copy of the terms of the MIT License and also a copyright notice.[1]"
You'll find that Ollama is also distributed under an MIT license[2]. It's fine to disagree with their priorities and lack of transparency. But trying to argue how they use code from other repositories that permit such a thing is tilting at windmills, IMHO.
[0] https://github.com/ggerganov/llama.cpp/blob/master/LICENSE
[1] https://en.wikipedia.org/wiki/MIT_License
[2] https://github.com/ollama/ollama/blob/main/LICENSE
moffkalast 1 year ago
Ollama is a private for profit company, of course there's something shady going on.
- ethbr1 1 year ago
  
  Ollama is a private for profit AI company, of course there's something shady going on.
  Because apparently you can take unethical business practices, add AI, and suddenly it's a whole new thing that no one can judge!
  
  1 reply →
andy_ppp 1 year ago
It would be extremely unsurprising if Nvidia was funding this embrace and extend behind the scenes.
- parineum 1 year ago
  
  It would be pretty surprising to their shareholders if Nvidia was hiding where it was spending it's money.

2-3-7-43-1807 1 year ago

can someone please give a quick summary of the criticism towards ollama?

as far as my intel goes it's a mozilla project shouldered mostly by one 10x programmer. i found ollama through hn and last time i didn't notice any lack of trust or suspected sketchiness ... so what changed?

denverllc 1 year ago

IMO ggerganov is a 10x programmer in the same way Fabrice Bellard is: doing the actual hard infrastructure work that most developers would not be able to do in a reasonable amount of time and at a high performance.
In contrast, the ollama dev team is doing useful work (creating an easy interface) but otherwise mostly piggybacking off the already existing infrastructure
woadwarrior01 1 year ago
> as far as my intel goes it's a mozilla project shouldered mostly by one 10x programmer.
That's completely off the mark.
https://www.ycombinator.com/companies/ollama
- 2-3-7-43-1807 1 year ago
  
  seems like i confused llamafile with ollama ... this whole llm biotope is a huge mess
buyucu 1 year ago

ollama has been advertising llama.cpp features as their own, which I find very dishonest.

llm_trw 1 year ago

Can someone explain what the point of ollama is?

Every time I look at it, it seems like it's a worse llama.cpp that removes options to make things "easier".

michaelt 1 year ago

Open-weights LLMs provide a dizzying array of options.
You'd have Llama, Mistral, Gemma, Phi, Yi.
You'd have Llama, Llama 2, Llama 3, Llama 3.2...
And those offer with 8B, 13B or 70B parameters
And you can get it quantised to GGUF, AWQ, exl2...
And quantised to 2, 3, 4, 6 or 8 bits.
And that 4-bit quant is available as Q4_0, Q4_K_S, Q4_K_M...
And on top of that there are a load of fine-tunes that score better on some benchmarks.
Sometimes a model is split into 30 files and you need all 30, other times there's 15 different quants in the same release and you only need a single one. And you have to download from huggingface and put the files in the right place yourself.
ollama takes a lot of that complexity and hides it. You run "ollama run llama3.1" and the selection and download all gets taken care of.
danielbln 1 year ago

Not to be snide, but removing options to make things easier has been wildly successful in a variety of project/products.
pornel 1 year ago
Ollama : llama.cpp :: Dropbox : rsync
- diggan 1 year ago
  
  Not sure this is a good analogy. LM Studio is closer to Dropbox as both takes X and makes it easier for users who don't necessarily are very technical. Ollama is a developer-oriented tool (used via terminal + a daemon), so wouldn't compare it to what Dropbox is/did for file syncing.
portaouflop 1 year ago

It’s to make things easier for casual users.
With ollama I type brew install ollama and then ollama get something, and I have it already running. With llama.ccp it’s seems i have to build it first, then manually download models somewhere - this is an instant turnoff, i maybe have 5 minutes of my life to waste on this
ianpurton 1 year ago

It's very easy to install and add models.
baq 1 year ago

yeah that's literally the point. you're listing something that you think is a disadvantage and some people think exactly the opposite.