Comment by AYBABTME

11 hours ago

This right now today is making the case for OSS AI and local inference. 200$/m to get rate limited makes a RTX 6000 Pro look cheap.

How well do local OSS models stack up to Claude?

  • Very well for narrowly scoped purposes.

    They decohere much faster as the context grows. Which is fine, or not, depending on whether you consider yourself a software engineer amplifying your output by automating the boilerplate, or an LLM cornac.

  • Much better than they did half a year ago, but a single RTX 6000 won't get you there

    Models in the 700B+ category (GLM5, Kimi K2.5) are decent, but running those on your own hardware is a six-figure investment. Realistic for a company, for a private person instead pick someone you like from openrouter's list of inference providers.

    If you really want local on a realistic budget, Qwen 3.5 35B is ok. But not anywhere near Claude Opus

    • > but running those on your own hardware is a six-figure investment

      GLM-5 is a 744B MoE with 40B active. You can run a Q4_K_M quant on llama.cpp if you can afford 512GB of RAM. An RTX 6000 will help a lot with the prompt processing, and the generations with be relatively fast if you have decent memory bandwidth. llama.cpp's autofit feature is really good at dividing the layers for MoEs to max speed when offloading.

What’s the depreciation on that RTX 6000 though?

New hardware keeps on coming with large gains in performance.

  • Does it? Market looks like it'll be harder for consumers to get such hardware for the time being. A RTX 6000 might appeciate, instead of depreciate.

    • > Does it? Market looks like it'll be harder for consumers

      Yes. I never specifically talked about consumers only though.