Comment by aliljet

15 hours ago

If the SWE Bench results are to be believed... this looks best in class right now for a local LLM. To be fair, show me the guy who is running this locally...

1 comment

aliljet

selfhoster11 15 hours ago

It's challenging, but not impossible. With 2-bit quantisation, only about 250-ish gigabytes of RAM is required. It doesn't have to be VRAM either, and you can mix and match GPU+CPU inference.

In addition, some people on /r/localLlama are having success with streaming the weights off SSD storage at 1 token/second, which is about the rate I get for DeepSeek R1.