← Back to context

Comment by segmondy

21 hours ago

llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.

How much do you use?

I have lots of trouble figuring out what the limits are of a system with x amount of vram and y amounts of ram. How do you determine this?