Comment by anakaine

2 months ago

Llama.cpp now has a gui installed by default. It previously lacked this. Times have changed.

27 comments

anakaine

Having read above article, I just gave llama.cpp a shot. It is as easy as the author says now, though definitely not documented quite as well. My quickstart:

brew install llama.cpp

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF --port 8000

Go to localhost:8000 for the Web UI. On Linux it accelerates correctly on my AMD GPU, which Ollama failed to do, though of course everyone's mileage seems to vary on this.

teekert 2 months ago
Was hoping it was so easy :) But I probably need to look into it some more.
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4' llama_model_load_from_file_impl: failed to load model
Edit: @below, I used `nix-shell -p llama-cpp` so not brew related. Could indeed be an older version indeed! I'll check.
- adrian_b 2 months ago
  
  As it has been discussed in a few recent threads on HN, whenever a new model is released, running it successfully may need changes in the inference backends, such as llama.cpp.
  There are 2 main reasons. One is the tokenizer, where new tokenizer definitions may be mishandled by the older tokenizer parsers.
  The second reason is that each model may implement differently the tool invocations, e.g. by using different delimiter tokens and different text layouts for describing the parameters of a tool invocation.
  Therefore running the Gemma-4 models encountered various problems during the first days after their release, especially for the dense 31B model.
  Solving these problems required both a new version of llama.cpp (also for other inference backends) and updates in the model chat template and tokenizer configuration files.
  So anyone who wants to use Gemma-4 should update to the latest version of llama.cpp and to the latest models from Huggingface, because the latest updates have been a couple of days ago.
- roosgit 2 months ago
  
  I just hit that error a few minutes ago. I build my llama.cpp from source because I use CUDA on Linux. So I made the mistake of trying to run Gemma4 on an older version I had and I got the same error. It’s possible brew installs an older version which doens’t support Gemma4 yet.
  
  9 replies →

OtherShrezzing 2 months ago

While that might be true, for as long as its name is “.cpp”, people are going to think it’s a C++ library and avoid it.

eterm 2 months ago
This is the first I'm learning that it isn't just a C++ library.
In fact the first line of the wikipedia article is:
> llama.cpp is an open source software library
RobotToaster 2 months ago
It would make sense to just make the GUI a separate project, they could call it llama.gui.
- gettingoverit 2 months ago
  
  It would make even more sense to rename it to ollama, get a copyright for the name, and see how thieves complain they've been robbed :>
- homarp 2 months ago
  
  it is called llama-barn https://github.com/ggml-org/LlamaBarn
  
  2 replies →
figassis 2 months ago

This is correct, and I avoided it for this reason, did not have the bandwidth to get into any cpp rabbit hole so just used whatever seemed to abstract it away.
marssaxman 2 months ago

Wait, it isn't? The name very strongly suggests that it is a text file containing C++ source code; is that not the case?

mijoharas 2 months ago

Frankly I think the cli UX and documentation is still much better for ollama.

It makes a bunch of decisions for you so you don't have to think much to get a model up and running.

zombot 2 months ago

I don't care about the GUI so much. Ollama lets me download, adjust and run a whole bunch of models and they are reasonably fast. Last time I compared it with Llama.cpp, finding out how to download and install models was a pain in Llama.cpp and it was also _much_ slower than Ollama.

throwa356262 2 months ago

That is not true.
If you today visit a models page on huggingface, the site will show you the exact oneliner you need to run to it on llama.cpp.
I didn't measure it, but both download and inference felt faster than ollama. One thing that was definitely better was memory usage, which may be important if you want to run small models on SCB.
anakaine 2 months ago

Having picked it up recently and compared to both llama and lm studio - the models I was using ran faster, used less memory, and had a few extra confif options available that the others hadn't implemented yet but were suggested by the model authors.
It was easy to install, run, and access the gui to get going.