Comment by bityard

7 hours ago

There are infinite combinations of CPU/GPU capable of running LLMs locally. What most people do is buy the system they can afford and roughly meets their goals and then ball-park VRAM usage by looking at the model size and quantization.

For more a detailed analysis, there are several online VRAM calculators. Here's one: https://smcleod.net/vram-estimator/

If you have a huggingface account, you can set your system configuration and then you get little icons next to each quant in the sidebar. (Green: will likely fit, Yellow: Tight fit, Red: will not fit)

Further, t/s depends greatly on a lot of different factors, the best you might get is a guess based on context size.

One thing about running local LLMs right now, is that there are tradeoffs literally everywhere and you have to choose what to optimize for down to the individual task.

4 comments

bityard

zargon 4 hours ago

These calculators are almost entirely useless. They don't understand specific model architectures. Even the ones that try to support only specific models (like the apxml one) get it very wrong a lot of the time.

For example, the one you linked, when I provide a Qwen3.5 27B Q_4_M GGUF [0], says that it will require 338 GB of memory with 16-bit kv cache. That is wrong by over an order of magnitude.

[0] https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF/resol...

gdevenyi 2 hours ago
Mine does https://github.com/gdevenyi/huggingface-estimate
- zargon 2 hours ago
  
  Excellent job with this! I tried a few combinations that completely fail on other calculators and yours gets VRAM usage pretty much spot on, and even the performance estimate is in the ballpark to what I see with mixed VRAM / RAM workloads.
  It's a shame that search is so polluted these days that it's impossible to find good tools like yours.

holoduke 1 hour ago

Just ask Claude to install the most optimum model with a nice chat ui tailored to your wishes. 15'minutes max.