Comment by quacker

1 month ago

I could have used this article before I spent the weekend arriving to the same conclusion!

Same laptop, and my contrived test was having it fix 50 or so lint errors in a small vibe-coded C++ repo. I wanted it to be able to handle a bunch of small tasks without getting stuck too often.

GPT OSS 20B was usable but slow, and actually frequently made mistakes like adding or duplicating statements unnecessarily, listing things as fixed without editing the code, and so on.

Qwen 3.5 9B with Opencode was much faster and actually able to work through a majority of the lint warnings without getting stuck, even through compaction and it fixed every warning with a correct edit.

I tried 4bit MLX quants of Qwen 3.5 9B but it eventually would crash due to insufficient memory. I switched to GGUF, which I run with llama.cpp, and it runs without crashing.

It is absolutely not comparable to frontier models. It’s way slower and gets basic info wrong and really can’t handle non trivial tasks in one go. I asked it for an architecture summary of the project and it claimed use of a library that isn’t present anywhere in the repo. So YMMV, but it’s still nice to have and hopefully the local LLM story can get much better on modest hardware over time.

> It is absolutely not comparable to frontier models.

This is not said often enough.

Yes, local LLMs are great! But reading most HN posts on the subject, you'd think they're within reach of Opus 4.7.

There is a very small, very vocal, very passionate crowd that dramatically overstates the capabilities of local LLMs on HN.

  • Very different from my experience, Gemma 31b just solved a physics problem Opus 4.7 gave up on. I definitely don't think they're equivalent in general, Opus for sure is way smarter and way more likely to get things right on the edge, but it's still quite likely to get things wrong too it doesn't make it that useful for a lot of stuff. Conversely there are so many things that you would use an LLM for that they will both reliably oneshot. Especially in agentic mode where you have ground truth feedback between turns the difference gets quite small for a lot of tasks.

    That all being said I've spent hundreds (maybe thousands?) of hours on this stuff over the past few years so I don't see a lot of the rough edges. I really believe capability is there, Gemma 4 31B is a useful agent for all sorts of stuff, and anything you can reasonably expect an LLM to oneshot Qwen 3.6 35b MoE will handle at like 90tok/sec, absolutely fantastic for tasks that don't require a huge amount of precision.

  • At least in my experience, local models are very far away from models like Opus 4.7 or ChatGPT 5.5 in coding and problem solving areas.

    I find them useful in basic research and learning and question asking tasks. Although at the same time, a Wikipedia page read or a few Google searches likely could accomplish the same and has been able to for decades.

    • I think you're doing it wrong. Use the frontier moddels for the research, planning etc and once you have a plan give it to a local model for implementation.

  • This.

    I have seen way too many people who are overly optimistic about local LLMs.

    Having spent a decent amount of time playing with them on consumer nvidia GPUs, I understand well that they not going to be widely usable any time soon. Unfortunately not many people share that.

    • Not this. Let's reframe the problem. How many years behind do you think they are? By all accounts Gemma 4 is better than a frontier model from 3 years ago. Back then we were wowed by frontier models but when the local model reaches the same performance it's no good anymore, because you moved the target?

      Relatively speaking local models might always be behind the curve compared to frontier ones. You can tell by the hardware needed to run each. But in absolute terms they're already past the performance threshold everyone praised in the past.

      Right now in a lab somewhere there's a model that's probably better than anything else. There's a ChatGPT 5.6, an Opus 4.8. Knowing that do you suddenly feel a wave of disappointment at the current frontier models?

    • So the cofounder of hugging face made a post about qwen 3.6 being atclaude level of performance for the lols?

      When were you trying local models? The model releases from April 2026 are a serious change in performance.

      6 replies →

  • That's totally fine and dandy as there is a very big, very vocal, very brainwashed crowd that dramatically overstates the capabilities of remote LLMs on HN as well.

  • You are missing context.

    A local model is as good as a frontier model for responding on a signal threat with you which requieres basic tool calling.

    A local model is as good as a frontier model of writing a joke.

    A local model is as good as a frontier model at responding to an email.

    Not sure what needs to be said often enough, no one without a clue would play around with local model setup and would compleltly ignore frontier models and their capabilities?!

  • Im like 50% convinced that these people are paid by Apple to promote their products. Because the conversation is always just being able to execute models (even larger ones), on mac hardware with unified memory, but nobody ever mentions that inference speed is unusably slow.

    You can have good local LLM performance through agents, but you need fast inference. Generally, 2x 3090 or at the minimum 2x3080s (you need 2 to speed up prefill processing to build KV Cache). You just ironically need to be good at prompt engineering, which has a lot of analogue in real world on being able to manage low skilled people in completing tasks.

Honestly surprised to hear that GPT OSS 20B runs slow on mac hardware. It's absolutely one of the fastest models I've run on local GPUs for its size, but only tried Nvidia cards.

Edit: TIL it is MoE and only has 3.6B active, explains a lot.

  • Yeah, I'm probably wrong there. GPT OSS 20B is certainly much faster than some other models I've tried. I actually gave GPT OSS 20B a few prompts just now and it seems to respond as fast or faster than Qwen 3.5 9B. But I needed many more prompts for GPT OSS 20B to complete my contrived task, so progress felt much slower.