Comment by brabel

8 hours ago

I want to add an inference engine to my product. I was hoping to use ollama because it really helps, I think, make sure you have a model with the right metadata that you can count on working (I've seen that with llama.cpp, it's easy to get the metadata wrong and start getting rubbish from the LLM because the "stop_token" was wrong or something). I'd thought ollama was a proponent of the GGUF, which I really like as it standardizes metadata?!

What would be the best way to use llama.cpp and models that use GGUF these days? ramallama is a good alternative (I guess it is, but it's not completely clear from your message)? Or just use llama.cpp directly, in which case how to ensure I don't get rubbish (like the model asking and answering questions by itself without ever stopping)??

1 comment

brabel

llmthrowaway 1 hour ago

Meant to say llama-swap instead of llama-server. llama-swap adds a gui and dynamic model switching on top of llama-server. Somewhat tricky to set up as it relies on a .yaml file that is poorly documented for using with docker but something like:

  "GLM4-Air":
    env:
      - "CUDA_VISIBLE_DEVICES=1"
    proxy: "http://127.0.0.1:9999"
    cmd: >
      /app/llama-server
      --cache-type-k q8_0 --cache-type-v q8_0
      --flash-attn
      --ctx-size 32684
      --jinja
      -ngl 20
      --model /modelfiles/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf
      --port 9999

When run via docker this gets you a similar setup to ollama. The yaml file also needs TTL set if you want it to unload models after an idle period.

Ollama native models in their marketplace have these params supposedly set correctly to save you having to do this config but in practice this is hit or miss and often these change from day 0 of the release.