← Back to context

Comment by llmthrowaway

1 day ago

Meant to say llama-swap instead of llama-server. llama-swap adds a gui and dynamic model switching on top of llama-server. Somewhat tricky to set up as it relies on a .yaml file that is poorly documented for using with docker but something like:

  "GLM4-Air":
    env:
      - "CUDA_VISIBLE_DEVICES=1"
    proxy: "http://127.0.0.1:9999"
    cmd: >
      /app/llama-server
      --cache-type-k q8_0 --cache-type-v q8_0
      --flash-attn
      --ctx-size 32684
      --jinja
      -ngl 20
      --model /modelfiles/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf
      --port 9999

When run via docker this gets you a similar setup to ollama. The yaml file also needs TTL set if you want it to unload models after an idle period.

Ollama native models in their marketplace have these params supposedly set correctly to save you having to do this config but in practice this is hit or miss and often these change from day 0 of the release.