Comment by MyUltiDev

2 months ago

The attribution and lock-in arguments are the loud parts of this story, but the quieter production reason to move is concurrency. llama.cpp's server takes parallel N with cont-batching enabled by default, which interleaves tokens from multiple requests inside a single batch and keeps the GPU busy. Ollama defaults its parallel slots low and the interaction is less transparent, so the first time three people share a single model instance you feel it before any of the ethics become relevant. For a 70B Q4_K_M on a workstation, the real ceiling is KV cache fragmentation, and you have to size the context window around the parallel count rather than around one user. What is the highest parallel value anyone here has kept stable on a 70B Q4_K_M before the cache eviction pattern starts hurting quality?

0 comments

MyUltiDev

No comments yet

Contribute on Hacker News ↗