Comment by samus

1 day ago

There aren't any because it depends a lot on what your use case is, what speed you expect, how accurate you want it to run, how many users want to use it, and how much context size you need.

- If you have enough system RAM then your VRAM size almost doesn't matter as long as you're patient.

- For most models, running them at 16bit precision is a waste, unless you're fine-tuning. The difference to Q8 is negligible, Q6 is still very faithful. In return, they need less memory and get faster.

- Users obviously need to share computing resources with each other. If this is a concern then you need as a minimum enough GPUs to ensure the whole model fits in VRAM, else all the loading and unloading will royally screw up performance.

- Maximum context length is crucial to think about since it has to be stored in memory as well, preferably in VRAM. Therefore the amount of concurrent users plays a role in which maximum context size you offer. But it is also possible to offload it to system RAM or to quantize it.

Rule of thumb: budget 1.5*s where s is the model size at the quantization level you're using. Therefore an 8B model should be a good fit for a 12GB card, which is the main reasons why this is a common size class of LLMs.

0 comments

samus

No comments yet

Contribute on Hacker News ↗