Comment by omneity
8 hours ago
This is a great project. FYI all you need is the size of an LLM and the memory amount & bandwidth to know if it fits and the tok/s
It’s a simple formula:
llm_size = number of params * size_of_param
So a 32B model in 4bit needs a minimum of 16GB ram to load.
Then you calculate
tok_per_s = memory_bandwidth / llm_size
An RTX3090 has 960GB/s, so a 32B model (16GB vram) will produce 960/16 = 60 tok/s
For an MoE the speed is mostly determined by the amount of active params not the total LLM size.
Add a 10% margin to those figures to account for a number of details, but that’s roughly it. RAM use also increases with context window size.
> RAM use also increases with context window size.
KV cache is very swappable since it has limited writes per generated token (whereas inference would have to write out as much as llm_active_size per token, which is way too much at scale!), so it may be possible to support long contexts with quite acceptable performance while still saving RAM.
Make sure also that you're using mmap to load model parameters, especially for MoE experts. It has no detrimental effect on performance given that you have enough RAM to begin with, but it allows you to scale up gradually beyond that, at a very limited initial cost (you're only replacing a fraction of your memory_bandwidth with much lower storage_bandwidth).
Well mmap can still cause issues if you run short on RAM, and the disk access can cause latency and overall performance issues. It's better than nothing though.
Agree that k/v cache is underutilized by most folks. Ollama disables Flash Attention by default, so you need to enable it. Then the Ollama default quantization for k/v cache is fp16, you can drop to q8_0 in most cases. (https://mitjamartini.com/posts/ollama-kv-cache-quantization/) (https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...)
This is a good rule of thumb. I would also include that in most cases, RAM use exponentially increases with context window size.