← Back to context

Comment by llmthrowaway

18 hours ago

Confusing title - thought this was about Ollama finally supporting sharded GGUF (ie. the Huggingface default for large gguf over 48gb).

https://github.com/ollama/ollama/issues/5245

Sadly it is not and the issue still remains open after over a year meaning ollama cannot run the latest SOTA open source models unless they covert them to their proprietary format which they do not consistently do.

No surprise I guess given they've taken VC money, refuse to properly attribute the use things like llama.cpp and ggml, have their own model format for.. reasons? and have over 1800 open issues...

Llama-server, ramallama or whatever model switcher ggerganov is working on (he showed previews recently) feel like the way forward.

I want to add an inference engine to my product. I was hoping to use ollama because it really helps, I think, make sure you have a model with the right metadata that you can count on working (I've seen that with llama.cpp, it's easy to get the metadata wrong and start getting rubbish from the LLM because the "stop_token" was wrong or something). I'd thought ollama was a proponent of the GGUF, which I really like as it standardizes metadata?!

What would be the best way to use llama.cpp and models that use GGUF these days? ramallama is a good alternative (I guess it is, but it's not completely clear from your message)? Or just use llama.cpp directly, in which case how to ensure I don't get rubbish (like the model asking and answering questions by itself without ever stopping)??

  • Meant to say llama-swap instead of llama-server. llama-swap adds a gui and dynamic model switching on top of llama-server. Somewhat tricky to set up as it relies on a .yaml file that is poorly documented for using with docker but something like:

      "GLM4-Air":
        env:
          - "CUDA_VISIBLE_DEVICES=1"
        proxy: "http://127.0.0.1:9999"
        cmd: >
          /app/llama-server
          --cache-type-k q8_0 --cache-type-v q8_0
          --flash-attn
          --ctx-size 32684
          --jinja
          -ngl 20
          --model /modelfiles/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf
          --port 9999
    

    When run via docker this gets you a similar setup to ollama. The yaml file also needs TTL set if you want it to unload models after an idle period.

    Ollama native models in their marketplace have these params supposedly set correctly to save you having to do this config but in practice this is hit or miss and often these change from day 0 of the release.