Comment by tmountain

13 hours ago

How well do local OSS models stack up to Claude?

4 comments

tmountain

Very well for narrowly scoped purposes.

They decohere much faster as the context grows. Which is fine, or not, depending on whether you consider yourself a software engineer amplifying your output by automating the boilerplate, or an LLM cornac.

wongarsu 10 hours ago

Much better than they did half a year ago, but a single RTX 6000 won't get you there

Models in the 700B+ category (GLM5, Kimi K2.5) are decent, but running those on your own hardware is a six-figure investment. Realistic for a company, for a private person instead pick someone you like from openrouter's list of inference providers.

If you really want local on a realistic budget, Qwen 3.5 35B is ok. But not anywhere near Claude Opus

Eisenstein 8 hours ago

> but running those on your own hardware is a six-figure investment
GLM-5 is a 744B MoE with 40B active. You can run a Q4_K_M quant on llama.cpp if you can afford 512GB of RAM. An RTX 6000 will help a lot with the prompt processing, and the generations with be relatively fast if you have decent memory bandwidth. llama.cpp's autofit feature is really good at dividing the layers for MoEs to max speed when offloading.

sunaookami 13 hours ago

They don't, only on meaningless benchmarks.