← Back to context

Comment by adeon

3 months ago

In my case just CPU (it's a Hetzner server, checked in /proc/cpuinfo and it said "AMD EPYC 9454P 48-Core Processor"). I apparently had still in terminal backlog some stats, so I pasted below.

It's not a speed demon but enough to mess around and test things out. Thinking can sometimes be pretty long so it can take a while to get responses, even if 6 tokens/sec is pretty good considering pure CPU setup.

---

prompt eval time = 133.55 ms / 1 tokens ( 133.55 ms per token, 7.49 tokens per second) eval time = 392205.46 ms / 2220 tokens ( 176.67 ms per token, 5.66 tokens per second) total time = 392339.02 ms / 2221 tokens

And my exact command was:

llama-server --model DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf --temp 0.6 -c 9000 --min-p 0.1 --top-k 0 --top-p 1 --timeout 3600 --slot-save-path ~/llama_kv_path --port 8117 -ctk q8_0

(IIRC slot save path argument does absolutely nothing unless and is superfluous, but I have been pasting a similar command around and been too lazy to remove it). -ctk q8_0 reduces memory use a bit for context.

I think my 256gb is right at the limit of spilling a bit into swap, so I'm pushing the limits :)

The --min-p 0.1 was a recommendation from Unsloth page; I think because the quant is going so low in bits, some things may start to misbehave and it is a mitigation. But I haven't messed around enough to say how true that is, or any nuance about it. I think I put --temp 0.6 for the same reason.

To explain to anyone not aware of llama-server: it exposes (a somewhat) OpenAI-compatible API and then you can use it with any software that speaks that. llama-server itself also has a UI, but I haven't used it.

I had some SSH tunnels set up to use the server interface with https://github.com/oobabooga/text-generation-webui where I hacked an "OpenAI" client to it (that UI doesn't have it natively). The only reason I use the oobabooga UI is out of habit so I don't recommend this setup to others.